Measuring Safety Alignment Effects in Autonomous Security Agents
Pith reviewed 2026-05-20 04:21 UTC · model grok-4.3
The pith
Safety alignment effects in autonomous security agents must be measured at the system level by separating refusal, unsafe action, tool use, and evidence grounding rather than relying on refusal rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Stock safety-aligned models and their uncensored or abliterated derivatives were run on 30 tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks. The Gemma less-restricted versions reached 14.0 percent and 10.7 percent success versus near-zero for the aligned pair, with higher mean grounding scores and zero refusal or unsafe-action rates, yet controls showed the same gaps on non-security coding tasks, Qwen2.5-Coder performed worse when less-restricted, and the abliterated Llama failed the tool protocol. Across families, difficult verification tasks remained unsolved. These patterns indicate that safety alignment effects appear in tool reliability and
What carries the argument
Trace-based benchmark of 30 vulnerability-analysis tasks that records full agent traces, applies deterministic success predicates, redaction rules, and five-point grounding checks to compare stock versus less-restricted model pairs.
If this is right
- Refusal rate alone does not capture safety alignment effects in autonomous agents.
- Less-restricted derivatives can raise task success and evidence quality on security work.
- The same performance gaps appear on ordinary coding tasks, so security-specific claims require controls.
- Hard proof-of-trigger and patch-verification tasks remain unsolved by current models.
- Tool-protocol failures in some less-restricted variants show that alignment removal can break agent reliability.
Where Pith is reading between the lines
- Evaluators should routinely add non-security control tasks to isolate whether observed effects are domain-specific.
- Benchmarks that allow more flexible tool selection could test whether the current fixed-tool setup understates or overstates real deployment risks.
- If unsafe-action rates stay near zero, less-restricted models may become the default choice for defensive security work.
Load-bearing premise
The 30 tasks with their fixed tools, deterministic success predicates, redaction rules, and grounding checks form a representative proxy for real-world autonomous security agent behavior without introducing evaluation biases that favor less-restricted models.
What would settle it
A study showing that refusal rates alone accurately predict rates of unsafe actions or poor evidence grounding across open-ended, real-world security agent deployments would undermine the need for system-level measurement.
Figures
read the original abstract
Do stock safety-aligned language models and their uncensored or abliterated derivatives behave differently when run as autonomous security agents? Single-turn refusal benchmarks cannot answer this question: security agents must inspect repositories, call tools, and produce vulnerability evidence inside authorized sandboxes. We present a trace-based benchmark of 30 local vulnerability-analysis tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks, and compare four stock models against uncensored or abliterated derivatives: Gemma 4 31B, Gemma 4 26B A4B, Qwen2.5-Coder 7B, and Llama 3.1 8B. The artifact contains 1,500 security-agent traces and 800 non-security control traces. The Gemma pairs show large less-restricted gains on security tasks: 14.0% versus 0.7% success for 31B and 10.7% versus 0.0% for 26B, with higher mean grounding (3.91 versus 3.27 and 4.12 versus 1.64 out of five) and 0.0% refusal, suppressed-action, and unsafe-action rates in the 31B traces. However, controls and non-Gemma pairs rule out a clean security-specific or universal less-restricted effect: Gemma gaps also appear on ordinary coding tasks, Qwen2.5-Coder success is lower for the less-restricted derivative (2.0% versus 5.3%), and the abliterated Llama derivative fails the tool protocol. Across all families, hard proof-of-trigger and patch-verification tasks remain unsolved. These results show that safety alignment effects in autonomous security agents should be measured at the system level, separating refusal, unsafe action, tool reliability, and evidence grounding rather than treating refusal rate as the safety signal.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that single-turn refusal benchmarks are inadequate for evaluating safety alignment in autonomous security agents, which must perform multi-step tool use and evidence production. It introduces a trace-based benchmark of 30 local vulnerability-analysis tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks, then compares four stock models against their uncensored or abliterated derivatives using 1,500 security traces and 800 non-security controls. Results show large gains for less-restricted Gemma variants (e.g., 14.0% vs 0.7% success, higher grounding scores) but mixed or absent effects in other families and on controls, leading to the conclusion that safety must be measured at the system level by separating refusal, unsafe action, tool reliability, and evidence grounding rather than relying on refusal rate alone.
Significance. If the central empirical patterns hold after addressing potential benchmark artifacts, the work is significant for AI safety and security research. It supplies reproducible trace data and concrete metric divergences that demonstrate the insufficiency of refusal-only signals for agentic settings. Strengths include the use of non-security controls for triangulation, deterministic evaluation predicates, and the public artifact of 1,500 traces, which together provide a falsifiable, system-level evaluation framework.
major comments (2)
- [Section 3] Section 3 (Benchmark Design): The 30 tasks rely on fixed tools and deterministic success/grounding predicates that may embed output-style biases favoring less-restricted models (e.g., longer or less-censored traces satisfying redaction or evidence rules more readily). This is load-bearing for the central claim, because the reported divergences (Gemma 14.0% vs 0.7% success; grounding 3.91 vs 3.27) are interpreted as genuine alignment effects rather than proxy artifacts; an analysis correlating success with output length or verbosity across conditions is needed to rule this out.
- [Section 4] Section 4 (Results and Controls): While the paper notes mixed results across families (Qwen derivative lower success at 2.0% vs 5.3%) and unsolved hard tasks, the generalization to a general principle for system-level measurement would be strengthened by reporting per-task variance, statistical tests, or confidence intervals on the aggregate rates; without these, the triangulation from controls is suggestive but not yet conclusive for the claim that refusal rate is insufficient in general.
minor comments (2)
- [Abstract] Abstract and model descriptions: Standardize nomenclature (e.g., confirm whether 'Gemma 4 31B' refers to a specific Gemma-2 variant) to avoid reader confusion.
- [Figures/Tables] Figure and table captions: Ensure all reported scales (e.g., grounding out of five) are explicitly defined in captions for standalone readability.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each major comment below and have revised the manuscript accordingly to strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [Section 3] Section 3 (Benchmark Design): The 30 tasks rely on fixed tools and deterministic success/grounding predicates that may embed output-style biases favoring less-restricted models (e.g., longer or less-censored traces satisfying redaction or evidence rules more readily). This is load-bearing for the central claim, because the reported divergences (Gemma 14.0% vs 0.7% success; grounding 3.91 vs 3.27) are interpreted as genuine alignment effects rather than proxy artifacts; an analysis correlating success with output length or verbosity across conditions is needed to rule this out.
Authors: We agree that explicitly ruling out output-style biases strengthens the interpretation of the results as alignment effects. In the revised manuscript we have added a post-hoc analysis in Section 3 that correlates task success and grounding scores with output length (token count) and verbosity (sentence count) for every model variant and condition. The analysis is reported in a new table together with a brief discussion of how the deterministic predicates emphasize content (presence of specific vulnerability evidence after redaction) rather than stylistic features. We have also clarified the design rationale for the fixed tools and predicates. revision: yes
-
Referee: [Section 4] Section 4 (Results and Controls): While the paper notes mixed results across families (Qwen derivative lower success at 2.0% vs 5.3%) and unsolved hard tasks, the generalization to a general principle for system-level measurement would be strengthened by reporting per-task variance, statistical tests, or confidence intervals on the aggregate rates; without these, the triangulation from controls is suggestive but not yet conclusive for the claim that refusal rate is insufficient in general.
Authors: We accept that additional statistical detail improves the strength of the generalization. The revised manuscript now includes a supplementary table of per-task success rates, bootstrap 95% confidence intervals for all aggregate metrics, and a short discussion of variance across the 30 tasks. We also note the limited power of formal tests given the task count while emphasizing that the mixed results across model families and the non-security controls already provide the triangulation supporting our system-level measurement recommendation. revision: yes
Circularity Check
No circularity: direct empirical measurements on fixed tasks
full rationale
The paper reports results from executing models on 30 fixed tasks with deterministic success predicates, redaction rules, and grounding checks, plus 800 control traces. All reported quantities (success rates such as 14.0% vs 0.7%, grounding scores such as 3.91 vs 3.27, refusal rates) are direct observations from these runs. The conclusion that safety alignment should be measured by separating refusal, unsafe action, tool reliability, and evidence grounding follows from comparing these observed divergences across model pairs and controls, without equations, fitted parameters renamed as predictions, self-citations for uniqueness theorems, or any reduction of outputs to inputs by construction. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can be reliably run as tool-calling agents inside sandboxes with deterministic success predicates
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present a trace-based benchmark of 30 local vulnerability-analysis tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
safety alignment effects ... should be measured at the system level, separating refusal, unsafe action, tool reliability, and evidence grounding
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems 30, 2017
work page 2017
-
[2]
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[3]
Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In Advances in Neural Information Processing Systems 33, 2020
work page 2020
-
[4]
A General Language Assistant as a Laboratory for Alignment
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a labora...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...
work page 2022
-
[6]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
TruthfulQA: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (V olume 1: Long Papers), 2022
work page 2022
-
[8]
Aligning AI with shared human values
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI with shared human values. InInternational Conference on Learning Representations, 2021
work page 2021
-
[9]
Ethical and social risks of harm from Language Models
Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. Ethical and...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
XSTest: A test suite for identifying exaggerated safety behaviours in large language models
Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), 2024
work page 2024
-
[11]
HarmBench: A standardized evaluation framework for automated red teaming and robust refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning, 2024. 10
work page 2024
-
[12]
Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems 37, Datasets ...
work page 2024
-
[13]
A StrongREJECT for empty jailbreaks
Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A StrongREJECT for empty jailbreaks. InAdvances in Neural Information Processing Systems 37, Datasets and Benchmarks Track, 2024
work page 2024
-
[14]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? InAdvances in Neural Information Processing Systems 36, 2023
work page 2023
-
[16]
Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, and Joshua Saxe. CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024
-
[17]
Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, and Joshua Saxe. CYBERSECEV AL 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models.arXiv preprint arXiv:2408.01605, 2024
-
[18]
Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Haoxiang Yang, Aolin Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Kenny Oseleononmen, Dan Boneh, Daniel Ho, a...
work page 2025
-
[19]
NYU CTF bench: A scalable open-source benchmark dataset for evaluating LLMs in offensive security
Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. NYU CTF bench: A scalable open-source benchmark dataset for evaluating LLMs in offensive security. InAdvances in Neural Information Processing S...
work page 2024
-
[20]
Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E. Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik R. Narasimhan, Ramesh Karri, and Ofir Press. EnIGMA: Interactive tools substantially assist LM agents in finding security vulnera...
work page 2025
-
[21]
SEC-bench: Automated bench- marking of LLM agents on real-world software security tasks
Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. SEC-bench: Automated bench- marking of LLM agents on real-world software security tasks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[22]
AgentBench: Evaluating LLMs as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representations, 2024
work page 2024
-
[23]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023. 11
work page 2023
-
[24]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems 36, 2023
work page 2023
-
[25]
WebShop: Towards scalable real-world web interaction with grounded language agents
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems 35, 2022
work page 2022
-
[26]
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024
work page 2024
-
[27]
InterCode: Standard- izing and benchmarking interactive coding with execution feedback
John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. InterCode: Standard- izing and benchmarking interactive coding with execution feedback. InAdvances in Neural Information Processing Systems 36, Datasets and Benchmarks Track, 2023
work page 2023
-
[28]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations, 2024
work page 2024
-
[29]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems 37, 2024
work page 2024
-
[30]
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J. Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information Pro...
work page 2024
-
[31]
Siegel, Nitya Nadgir, and Arvind Narayanan
Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter.arXiv preprint arXiv:2407.01502, 2024
-
[32]
Refusal in language models is mediated by a single direction
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems 37, 2024
work page 2024
-
[33]
Gemma 4: Byte for byte, the most capable open mod- els
Clement Farabet and Olivier Lacombe. Gemma 4: Byte for byte, the most capable open mod- els. https://blog.google/innovation-and-ai/technology/developers-tools/ gemma-4/, 2026. Accessed 2026-05-07
work page 2026
-
[34]
Google DeepMind. google/gemma-4-31b-it. https://huggingface.co/google/ gemma-4-31B-it, 2026. Accessed 2026-05-07
work page 2026
-
[35]
Google DeepMind. google/gemma-4-26b-a4b-it. https://huggingface.co/google/ gemma-4-26B-A4B-it, 2026. Accessed 2026-05-07
work page 2026
-
[36]
TrevorJS. Gemma 4 uncensored. https://huggingface.co/collections/TrevorJS/ gemma-4-uncensored, 2026. Accessed 2026-05-07
work page 2026
-
[37]
Trevorjs/gemma-4-31b-it-uncensored
TrevorJS. Trevorjs/gemma-4-31b-it-uncensored. https://huggingface.co/TrevorJS/ gemma-4-31B-it-uncensored, 2026. Accessed 2026-05-07
work page 2026
-
[38]
TrevorJS. Trevorjs/gemma-4-26b-a4b-it-uncensored.https://huggingface.co/TrevorJS/ gemma-4-26B-A4B-it-uncensored, 2026. Accessed 2026-05-07
work page 2026
-
[39]
unsloth/gemma-4-26b-a4b-it-gguf
Unsloth. unsloth/gemma-4-26b-a4b-it-gguf. https://huggingface.co/unsloth/ gemma-4-26B-A4B-it-GGUF, 2026. Accessed 2026-05-07
work page 2026
-
[40]
Trevorjs/gemma-4-26b-a4b-it-uncensored-gguf
TrevorJS. Trevorjs/gemma-4-26b-a4b-it-uncensored-gguf. https://huggingface.co/ TrevorJS/gemma-4-26B-A4B-it-uncensored-GGUF, 2026. Accessed 2026-05-07
work page 2026
-
[41]
Qwen/qwen2.5-coder-7b-instruct-gguf
Qwen. Qwen/qwen2.5-coder-7b-instruct-gguf. https://huggingface.co/Qwen/Qwen2. 5-Coder-7B-Instruct-GGUF, 2024. Accessed 2026-05-07. 12
work page 2024
-
[42]
bartowski/qwen2.5-coder-7b-instruct-abliterated-gguf
bartowski. bartowski/qwen2.5-coder-7b-instruct-abliterated-gguf. https://huggingface. co/bartowski/Qwen2.5-Coder-7B-Instruct-abliterated-GGUF , 2024. Accessed 2026-05-07
work page 2024
-
[43]
bartowski/meta-llama-3.1-8b-instruct-gguf
bartowski. bartowski/meta-llama-3.1-8b-instruct-gguf. https://huggingface.co/ bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, 2024. Accessed 2026-05-07
work page 2024
-
[44]
bartowski/meta-llama-3.1-8b-instruct-abliterated-gguf
bartowski. bartowski/meta-llama-3.1-8b-instruct-abliterated-gguf. https://huggingface. co/bartowski/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF , 2024. Accessed 2026-05-07
work page 2024
-
[45]
Trevorjs/gemma-4-31b-it-uncensored-gguf
TrevorJS. Trevorjs/gemma-4-31b-it-uncensored-gguf. https://huggingface.co/ TrevorJS/gemma-4-31B-it-uncensored-GGUF, 2026. Accessed 2026-05-07
work page 2026
-
[46]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Joh...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
The GEM benchmark: Natural language generation, its evaluation and metrics
Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ond ˇrej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Y...
work page 2021
-
[48]
Inspect AI: Framework for large language model evaluations
UK AI Safety Institute. Inspect AI: Framework for large language model evaluations. https: //inspect.aisi.org.uk/, 2024. Accessed 2026-05-07
work page 2024
-
[49]
Evaluating frontier models for dangerous capabilities,
Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca ...
-
[50]
Model evaluation for extreme risks
Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, and Allan Dafoe. Model evaluation for extreme risks.arXiv prepr...
-
[51]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stef...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[52]
Release Strategies and the Social Impacts of Language Models
Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-V oss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse, Jason Blazakis, Kris McGuffie, and Jasmine Wang. Release strategies and the social impacts of language models.arXiv preprint arXiv:1908.09203, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[53]
Jonas B. Sandbrink. Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools.arXiv preprint arXiv:2306.13952, 2023
-
[54]
MITRE Corporation. Common weakness enumeration. https://cwe.mitre.org/, 2024. Accessed 2026-05-07
work page 2024
-
[55]
OWASP Top 10: The ten most critical web application security risks
OWASP Foundation. OWASP Top 10: The ten most critical web application security risks. https://owasp.org/www-project-top-ten/, 2021. Accessed 2026-05-07
work page 2021
-
[56]
llama.cpp: LLM inference in C/C++
Georgi Gerganov and contributors. llama.cpp: LLM inference in C/C++. https://github. com/ggerganov/llama.cpp, 2023. Accessed 2026-05-07
work page 2023
-
[57]
The Hugging Face Hub: Machine learning collaboration platform
Hugging Face. The Hugging Face Hub: Machine learning collaboration platform. https: //huggingface.co/docs/hub/index, 2024. Accessed 2026-05-07
work page 2024
-
[58]
JSON Schema Organization. JSON Schema draft 2020-12. https://json-schema.org/ draft/2020-12/json-schema-core.html, 2020. Accessed 2026-05-07. 14 A Artifact and Reproduction The artifact consists of the evaluation package security_agent_eval/, task catalogs in tasks/, model endpoint and GGUF-provenance configs in configs/, saved traces in runs/, gener- ate...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.