Measuring Safety Alignment Effects in Autonomous Security Agents

Arthur Gervais; Isaac David

arxiv: 2605.19722 · v1 · pith:7V5YWVTUnew · submitted 2026-05-19 · 💻 cs.CR · cs.AI

Measuring Safety Alignment Effects in Autonomous Security Agents

Isaac David , Arthur Gervais This is my paper

Pith reviewed 2026-05-20 04:21 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords safety alignmentautonomous security agentsrefusal benchmarksvulnerability analysissystem-level evaluationlarge language modelsuncensored modelstrace-based benchmark

0 comments

The pith

Safety alignment effects in autonomous security agents must be measured at the system level by separating refusal, unsafe action, tool use, and evidence grounding rather than relying on refusal rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether safety-aligned language models and their less-restricted versions behave differently when deployed as autonomous security agents that inspect code repositories, call tools, and produce vulnerability evidence. Single-turn refusal benchmarks fall short because these agents must complete multi-step tasks inside authorized sandboxes. A trace-based evaluation of 30 local vulnerability-analysis tasks on four model families shows large success and grounding gains for some less-restricted Gemma variants, yet similar patterns appear on ordinary coding controls and reverse or vanish in other families. Hard proof-of-trigger and patch-verification tasks stay unsolved across the board. The central result is that safety alignment should be assessed by tracking the full system behavior instead of treating refusal rate as the sole signal.

Core claim

Stock safety-aligned models and their uncensored or abliterated derivatives were run on 30 tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks. The Gemma less-restricted versions reached 14.0 percent and 10.7 percent success versus near-zero for the aligned pair, with higher mean grounding scores and zero refusal or unsafe-action rates, yet controls showed the same gaps on non-security coding tasks, Qwen2.5-Coder performed worse when less-restricted, and the abliterated Llama failed the tool protocol. Across families, difficult verification tasks remained unsolved. These patterns indicate that safety alignment effects appear in tool reliability and

What carries the argument

Trace-based benchmark of 30 vulnerability-analysis tasks that records full agent traces, applies deterministic success predicates, redaction rules, and five-point grounding checks to compare stock versus less-restricted model pairs.

If this is right

Refusal rate alone does not capture safety alignment effects in autonomous agents.
Less-restricted derivatives can raise task success and evidence quality on security work.
The same performance gaps appear on ordinary coding tasks, so security-specific claims require controls.
Hard proof-of-trigger and patch-verification tasks remain unsolved by current models.
Tool-protocol failures in some less-restricted variants show that alignment removal can break agent reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluators should routinely add non-security control tasks to isolate whether observed effects are domain-specific.
Benchmarks that allow more flexible tool selection could test whether the current fixed-tool setup understates or overstates real deployment risks.
If unsafe-action rates stay near zero, less-restricted models may become the default choice for defensive security work.

Load-bearing premise

The 30 tasks with their fixed tools, deterministic success predicates, redaction rules, and grounding checks form a representative proxy for real-world autonomous security agent behavior without introducing evaluation biases that favor less-restricted models.

What would settle it

A study showing that refusal rates alone accurately predict rates of unsafe actions or poor evidence grounding across open-ended, real-world security agent deployments would undermine the need for system-level measurement.

Figures

Figures reproduced from arXiv: 2605.19722 by Arthur Gervais, Isaac David.

**Figure 2.** Figure 2: Cross-pair security results. Success points include bootstrap 95% confidence intervals; [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Do stock safety-aligned language models and their uncensored or abliterated derivatives behave differently when run as autonomous security agents? Single-turn refusal benchmarks cannot answer this question: security agents must inspect repositories, call tools, and produce vulnerability evidence inside authorized sandboxes. We present a trace-based benchmark of 30 local vulnerability-analysis tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks, and compare four stock models against uncensored or abliterated derivatives: Gemma 4 31B, Gemma 4 26B A4B, Qwen2.5-Coder 7B, and Llama 3.1 8B. The artifact contains 1,500 security-agent traces and 800 non-security control traces. The Gemma pairs show large less-restricted gains on security tasks: 14.0% versus 0.7% success for 31B and 10.7% versus 0.0% for 26B, with higher mean grounding (3.91 versus 3.27 and 4.12 versus 1.64 out of five) and 0.0% refusal, suppressed-action, and unsafe-action rates in the 31B traces. However, controls and non-Gemma pairs rule out a clean security-specific or universal less-restricted effect: Gemma gaps also appear on ordinary coding tasks, Qwen2.5-Coder success is lower for the less-restricted derivative (2.0% versus 5.3%), and the abliterated Llama derivative fails the tool protocol. Across all families, hard proof-of-trigger and patch-verification tasks remain unsolved. These results show that safety alignment effects in autonomous security agents should be measured at the system level, separating refusal, unsafe action, tool reliability, and evidence grounding rather than treating refusal rate as the safety signal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that single-turn refusal benchmarks are inadequate for evaluating safety alignment in autonomous security agents, which must perform multi-step tool use and evidence production. It introduces a trace-based benchmark of 30 local vulnerability-analysis tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks, then compares four stock models against their uncensored or abliterated derivatives using 1,500 security traces and 800 non-security controls. Results show large gains for less-restricted Gemma variants (e.g., 14.0% vs 0.7% success, higher grounding scores) but mixed or absent effects in other families and on controls, leading to the conclusion that safety must be measured at the system level by separating refusal, unsafe action, tool reliability, and evidence grounding rather than relying on refusal rate alone.

Significance. If the central empirical patterns hold after addressing potential benchmark artifacts, the work is significant for AI safety and security research. It supplies reproducible trace data and concrete metric divergences that demonstrate the insufficiency of refusal-only signals for agentic settings. Strengths include the use of non-security controls for triangulation, deterministic evaluation predicates, and the public artifact of 1,500 traces, which together provide a falsifiable, system-level evaluation framework.

major comments (2)

[Section 3] Section 3 (Benchmark Design): The 30 tasks rely on fixed tools and deterministic success/grounding predicates that may embed output-style biases favoring less-restricted models (e.g., longer or less-censored traces satisfying redaction or evidence rules more readily). This is load-bearing for the central claim, because the reported divergences (Gemma 14.0% vs 0.7% success; grounding 3.91 vs 3.27) are interpreted as genuine alignment effects rather than proxy artifacts; an analysis correlating success with output length or verbosity across conditions is needed to rule this out.
[Section 4] Section 4 (Results and Controls): While the paper notes mixed results across families (Qwen derivative lower success at 2.0% vs 5.3%) and unsolved hard tasks, the generalization to a general principle for system-level measurement would be strengthened by reporting per-task variance, statistical tests, or confidence intervals on the aggregate rates; without these, the triangulation from controls is suggestive but not yet conclusive for the claim that refusal rate is insufficient in general.

minor comments (2)

[Abstract] Abstract and model descriptions: Standardize nomenclature (e.g., confirm whether 'Gemma 4 31B' refers to a specific Gemma-2 variant) to avoid reader confusion.
[Figures/Tables] Figure and table captions: Ensure all reported scales (e.g., grounding out of five) are explicitly defined in captions for standalone readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and have revised the manuscript accordingly to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Section 3] Section 3 (Benchmark Design): The 30 tasks rely on fixed tools and deterministic success/grounding predicates that may embed output-style biases favoring less-restricted models (e.g., longer or less-censored traces satisfying redaction or evidence rules more readily). This is load-bearing for the central claim, because the reported divergences (Gemma 14.0% vs 0.7% success; grounding 3.91 vs 3.27) are interpreted as genuine alignment effects rather than proxy artifacts; an analysis correlating success with output length or verbosity across conditions is needed to rule this out.

Authors: We agree that explicitly ruling out output-style biases strengthens the interpretation of the results as alignment effects. In the revised manuscript we have added a post-hoc analysis in Section 3 that correlates task success and grounding scores with output length (token count) and verbosity (sentence count) for every model variant and condition. The analysis is reported in a new table together with a brief discussion of how the deterministic predicates emphasize content (presence of specific vulnerability evidence after redaction) rather than stylistic features. We have also clarified the design rationale for the fixed tools and predicates. revision: yes
Referee: [Section 4] Section 4 (Results and Controls): While the paper notes mixed results across families (Qwen derivative lower success at 2.0% vs 5.3%) and unsolved hard tasks, the generalization to a general principle for system-level measurement would be strengthened by reporting per-task variance, statistical tests, or confidence intervals on the aggregate rates; without these, the triangulation from controls is suggestive but not yet conclusive for the claim that refusal rate is insufficient in general.

Authors: We accept that additional statistical detail improves the strength of the generalization. The revised manuscript now includes a supplementary table of per-task success rates, bootstrap 95% confidence intervals for all aggregate metrics, and a short discussion of variance across the 30 tasks. We also note the limited power of formal tests given the task count while emphasizing that the mixed results across model families and the non-security controls already provide the triangulation supporting our system-level measurement recommendation. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on fixed tasks

full rationale

The paper reports results from executing models on 30 fixed tasks with deterministic success predicates, redaction rules, and grounding checks, plus 800 control traces. All reported quantities (success rates such as 14.0% vs 0.7%, grounding scores such as 3.91 vs 3.27, refusal rates) are direct observations from these runs. The conclusion that safety alignment should be measured by separating refusal, unsafe action, tool reliability, and evidence grounding follows from comparing these observed divergences across model pairs and controls, without equations, fitted parameters renamed as predictions, self-citations for uniqueness theorems, or any reduction of outputs to inputs by construction. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen tasks and evaluation rules are valid proxies for security agent performance and that differences in traces reflect alignment effects rather than other model differences.

axioms (1)

domain assumption LLMs can be reliably run as tool-calling agents inside sandboxes with deterministic success predicates
The benchmark setup assumes models will follow the tool protocol and produce evaluable outputs.

pith-pipeline@v0.9.0 · 5864 in / 1259 out tokens · 48248 ms · 2026-05-20T04:21:22.711869+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a trace-based benchmark of 30 local vulnerability-analysis tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

safety alignment effects ... should be measured at the system level, separating refusal, unsafe action, tool reliability, and evidence grounding

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 8 internal anchors

[1]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems 30, 2017

work page 2017
[2]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[3]

Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In Advances in Neural Information Processing Systems 33, 2020

work page 2020
[4]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a labora...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...

work page 2022
[6]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (V olume 1: Long Papers), 2022

work page 2022
[8]

Aligning AI with shared human values

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI with shared human values. InInternational Conference on Learning Representations, 2021

work page 2021
[9]

Ethical and social risks of harm from Language Models

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. Ethical and...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

XSTest: A test suite for identifying exaggerated safety behaviours in large language models

Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), 2024

work page 2024
[11]

HarmBench: A standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning, 2024. 10

work page 2024
[12]

Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems 37, Datasets ...

work page 2024
[13]

A StrongREJECT for empty jailbreaks

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A StrongREJECT for empty jailbreaks. InAdvances in Neural Information Processing Systems 37, Datasets and Benchmarks Track, 2024

work page 2024
[14]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Jailbroken: How does LLM safety training fail? InAdvances in Neural Information Processing Systems 36, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? InAdvances in Neural Information Processing Systems 36, 2023

work page 2023
[16]

CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, and Joshua Saxe. CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

work page arXiv 2024
[17]

CyberSecEval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models.arXiv preprint arXiv:2408.01605, 2024

Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, and Joshua Saxe. CYBERSECEV AL 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models.arXiv preprint arXiv:2408.01605, 2024

work page arXiv 2024
[18]

Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Haoxiang Yang, Aolin Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Kenny Oseleononmen, Dan Boneh, Daniel Ho, a...

work page 2025
[19]

NYU CTF bench: A scalable open-source benchmark dataset for evaluating LLMs in offensive security

Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. NYU CTF bench: A scalable open-source benchmark dataset for evaluating LLMs in offensive security. InAdvances in Neural Information Processing S...

work page 2024
[20]

Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik R

Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E. Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik R. Narasimhan, Ramesh Karri, and Ofir Press. EnIGMA: Interactive tools substantially assist LM agents in finding security vulnera...

work page 2025
[21]

SEC-bench: Automated bench- marking of LLM agents on real-world software security tasks

Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. SEC-bench: Automated bench- marking of LLM agents on real-world software security tasks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[22]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representations, 2024

work page 2024
[23]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023. 11

work page 2023
[24]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems 36, 2023

work page 2023
[25]

WebShop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems 35, 2022

work page 2022
[26]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024

work page 2024
[27]

InterCode: Standard- izing and benchmarking interactive coding with execution feedback

John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. InterCode: Standard- izing and benchmarking interactive coding with execution feedback. InAdvances in Neural Information Processing Systems 36, Datasets and Benchmarks Track, 2023

work page 2023
[28]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations, 2024

work page 2024
[29]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems 37, 2024

work page 2024
[30]

Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J. Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information Pro...

work page 2024
[31]

Siegel, Nitya Nadgir, and Arvind Narayanan

Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter.arXiv preprint arXiv:2407.01502, 2024

work page arXiv 2024
[32]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems 37, 2024

work page 2024
[33]

Gemma 4: Byte for byte, the most capable open mod- els

Clement Farabet and Olivier Lacombe. Gemma 4: Byte for byte, the most capable open mod- els. https://blog.google/innovation-and-ai/technology/developers-tools/ gemma-4/, 2026. Accessed 2026-05-07

work page 2026
[34]

google/gemma-4-31b-it

Google DeepMind. google/gemma-4-31b-it. https://huggingface.co/google/ gemma-4-31B-it, 2026. Accessed 2026-05-07

work page 2026
[35]

google/gemma-4-26b-a4b-it

Google DeepMind. google/gemma-4-26b-a4b-it. https://huggingface.co/google/ gemma-4-26B-A4B-it, 2026. Accessed 2026-05-07

work page 2026
[36]

Gemma 4 uncensored

TrevorJS. Gemma 4 uncensored. https://huggingface.co/collections/TrevorJS/ gemma-4-uncensored, 2026. Accessed 2026-05-07

work page 2026
[37]

Trevorjs/gemma-4-31b-it-uncensored

TrevorJS. Trevorjs/gemma-4-31b-it-uncensored. https://huggingface.co/TrevorJS/ gemma-4-31B-it-uncensored, 2026. Accessed 2026-05-07

work page 2026
[38]

Trevorjs/gemma-4-26b-a4b-it-uncensored.https://huggingface.co/TrevorJS/ gemma-4-26B-A4B-it-uncensored, 2026

TrevorJS. Trevorjs/gemma-4-26b-a4b-it-uncensored.https://huggingface.co/TrevorJS/ gemma-4-26B-A4B-it-uncensored, 2026. Accessed 2026-05-07

work page 2026
[39]

unsloth/gemma-4-26b-a4b-it-gguf

Unsloth. unsloth/gemma-4-26b-a4b-it-gguf. https://huggingface.co/unsloth/ gemma-4-26B-A4B-it-GGUF, 2026. Accessed 2026-05-07

work page 2026
[40]

Trevorjs/gemma-4-26b-a4b-it-uncensored-gguf

TrevorJS. Trevorjs/gemma-4-26b-a4b-it-uncensored-gguf. https://huggingface.co/ TrevorJS/gemma-4-26B-A4B-it-uncensored-GGUF, 2026. Accessed 2026-05-07

work page 2026
[41]

Qwen/qwen2.5-coder-7b-instruct-gguf

Qwen. Qwen/qwen2.5-coder-7b-instruct-gguf. https://huggingface.co/Qwen/Qwen2. 5-Coder-7B-Instruct-GGUF, 2024. Accessed 2026-05-07. 12

work page 2024
[42]

bartowski/qwen2.5-coder-7b-instruct-abliterated-gguf

bartowski. bartowski/qwen2.5-coder-7b-instruct-abliterated-gguf. https://huggingface. co/bartowski/Qwen2.5-Coder-7B-Instruct-abliterated-GGUF , 2024. Accessed 2026-05-07

work page 2024
[43]

bartowski/meta-llama-3.1-8b-instruct-gguf

bartowski. bartowski/meta-llama-3.1-8b-instruct-gguf. https://huggingface.co/ bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, 2024. Accessed 2026-05-07

work page 2024
[44]

bartowski/meta-llama-3.1-8b-instruct-abliterated-gguf

bartowski. bartowski/meta-llama-3.1-8b-instruct-abliterated-gguf. https://huggingface. co/bartowski/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF , 2024. Accessed 2026-05-07

work page 2024
[45]

Trevorjs/gemma-4-31b-it-uncensored-gguf

TrevorJS. Trevorjs/gemma-4-31b-it-uncensored-gguf. https://huggingface.co/ TrevorJS/gemma-4-31B-it-uncensored-GGUF, 2026. Accessed 2026-05-07

work page 2026
[46]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Joh...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

The GEM benchmark: Natural language generation, its evaluation and metrics

Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ond ˇrej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Y...

work page 2021
[48]

Inspect AI: Framework for large language model evaluations

UK AI Safety Institute. Inspect AI: Framework for large language model evaluations. https: //inspect.aisi.org.uk/, 2024. Accessed 2026-05-07

work page 2024
[49]

Evaluating frontier models for dangerous capabilities,

Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca ...

work page arXiv 2024
[50]

Model evaluation for extreme risks

Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, and Allan Dafoe. Model evaluation for extreme risks.arXiv prepr...

work page arXiv 2023
[51]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stef...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[52]

Release Strategies and the Social Impacts of Language Models

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-V oss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse, Jason Blazakis, Kris McGuffie, and Jasmine Wang. Release strategies and the social impacts of language models.arXiv preprint arXiv:1908.09203, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908
[53]

Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools [Internet]

Jonas B. Sandbrink. Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools.arXiv preprint arXiv:2306.13952, 2023

work page arXiv 2023
[54]

Common weakness enumeration

MITRE Corporation. Common weakness enumeration. https://cwe.mitre.org/, 2024. Accessed 2026-05-07

work page 2024
[55]

OWASP Top 10: The ten most critical web application security risks

OWASP Foundation. OWASP Top 10: The ten most critical web application security risks. https://owasp.org/www-project-top-ten/, 2021. Accessed 2026-05-07

work page 2021
[56]

llama.cpp: LLM inference in C/C++

Georgi Gerganov and contributors. llama.cpp: LLM inference in C/C++. https://github. com/ggerganov/llama.cpp, 2023. Accessed 2026-05-07

work page 2023
[57]

The Hugging Face Hub: Machine learning collaboration platform

Hugging Face. The Hugging Face Hub: Machine learning collaboration platform. https: //huggingface.co/docs/hub/index, 2024. Accessed 2026-05-07

work page 2024
[58]

JSON Schema draft 2020-12

JSON Schema Organization. JSON Schema draft 2020-12. https://json-schema.org/ draft/2020-12/json-schema-core.html, 2020. Accessed 2026-05-07. 14 A Artifact and Reproduction The artifact consists of the evaluation package security_agent_eval/, task catalogs in tasks/, model endpoint and GGUF-provenance configs in configs/, saved traces in runs/, gener- ate...

work page 2020

[1] [1]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems 30, 2017

work page 2017

[2] [2]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[3] [3]

Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In Advances in Neural Information Processing Systems 33, 2020

work page 2020

[4] [4]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a labora...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...

work page 2022

[6] [6]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (V olume 1: Long Papers), 2022

work page 2022

[8] [8]

Aligning AI with shared human values

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI with shared human values. InInternational Conference on Learning Representations, 2021

work page 2021

[9] [9]

Ethical and social risks of harm from Language Models

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. Ethical and...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

XSTest: A test suite for identifying exaggerated safety behaviours in large language models

Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), 2024

work page 2024

[11] [11]

HarmBench: A standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning, 2024. 10

work page 2024

[12] [12]

Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems 37, Datasets ...

work page 2024

[13] [13]

A StrongREJECT for empty jailbreaks

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A StrongREJECT for empty jailbreaks. InAdvances in Neural Information Processing Systems 37, Datasets and Benchmarks Track, 2024

work page 2024

[14] [14]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Jailbroken: How does LLM safety training fail? InAdvances in Neural Information Processing Systems 36, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? InAdvances in Neural Information Processing Systems 36, 2023

work page 2023

[16] [16]

CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, and Joshua Saxe. CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

work page arXiv 2024

[17] [17]

CyberSecEval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models.arXiv preprint arXiv:2408.01605, 2024

Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, and Joshua Saxe. CYBERSECEV AL 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models.arXiv preprint arXiv:2408.01605, 2024

work page arXiv 2024

[18] [18]

Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Haoxiang Yang, Aolin Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Kenny Oseleononmen, Dan Boneh, Daniel Ho, a...

work page 2025

[19] [19]

NYU CTF bench: A scalable open-source benchmark dataset for evaluating LLMs in offensive security

Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. NYU CTF bench: A scalable open-source benchmark dataset for evaluating LLMs in offensive security. InAdvances in Neural Information Processing S...

work page 2024

[20] [20]

Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik R

Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E. Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik R. Narasimhan, Ramesh Karri, and Ofir Press. EnIGMA: Interactive tools substantially assist LM agents in finding security vulnera...

work page 2025

[21] [21]

SEC-bench: Automated bench- marking of LLM agents on real-world software security tasks

Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. SEC-bench: Automated bench- marking of LLM agents on real-world software security tasks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[22] [22]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representations, 2024

work page 2024

[23] [23]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023. 11

work page 2023

[24] [24]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems 36, 2023

work page 2023

[25] [25]

WebShop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems 35, 2022

work page 2022

[26] [26]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024

work page 2024

[27] [27]

InterCode: Standard- izing and benchmarking interactive coding with execution feedback

John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. InterCode: Standard- izing and benchmarking interactive coding with execution feedback. InAdvances in Neural Information Processing Systems 36, Datasets and Benchmarks Track, 2023

work page 2023

[28] [28]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations, 2024

work page 2024

[29] [29]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems 37, 2024

work page 2024

[30] [30]

Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J. Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information Pro...

work page 2024

[31] [31]

Siegel, Nitya Nadgir, and Arvind Narayanan

Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter.arXiv preprint arXiv:2407.01502, 2024

work page arXiv 2024

[32] [32]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems 37, 2024

work page 2024

[33] [33]

Gemma 4: Byte for byte, the most capable open mod- els

Clement Farabet and Olivier Lacombe. Gemma 4: Byte for byte, the most capable open mod- els. https://blog.google/innovation-and-ai/technology/developers-tools/ gemma-4/, 2026. Accessed 2026-05-07

work page 2026

[34] [34]

google/gemma-4-31b-it

Google DeepMind. google/gemma-4-31b-it. https://huggingface.co/google/ gemma-4-31B-it, 2026. Accessed 2026-05-07

work page 2026

[35] [35]

google/gemma-4-26b-a4b-it

Google DeepMind. google/gemma-4-26b-a4b-it. https://huggingface.co/google/ gemma-4-26B-A4B-it, 2026. Accessed 2026-05-07

work page 2026

[36] [36]

Gemma 4 uncensored

TrevorJS. Gemma 4 uncensored. https://huggingface.co/collections/TrevorJS/ gemma-4-uncensored, 2026. Accessed 2026-05-07

work page 2026

[37] [37]

Trevorjs/gemma-4-31b-it-uncensored

TrevorJS. Trevorjs/gemma-4-31b-it-uncensored. https://huggingface.co/TrevorJS/ gemma-4-31B-it-uncensored, 2026. Accessed 2026-05-07

work page 2026

[38] [38]

Trevorjs/gemma-4-26b-a4b-it-uncensored.https://huggingface.co/TrevorJS/ gemma-4-26B-A4B-it-uncensored, 2026

TrevorJS. Trevorjs/gemma-4-26b-a4b-it-uncensored.https://huggingface.co/TrevorJS/ gemma-4-26B-A4B-it-uncensored, 2026. Accessed 2026-05-07

work page 2026

[39] [39]

unsloth/gemma-4-26b-a4b-it-gguf

Unsloth. unsloth/gemma-4-26b-a4b-it-gguf. https://huggingface.co/unsloth/ gemma-4-26B-A4B-it-GGUF, 2026. Accessed 2026-05-07

work page 2026

[40] [40]

Trevorjs/gemma-4-26b-a4b-it-uncensored-gguf

TrevorJS. Trevorjs/gemma-4-26b-a4b-it-uncensored-gguf. https://huggingface.co/ TrevorJS/gemma-4-26B-A4B-it-uncensored-GGUF, 2026. Accessed 2026-05-07

work page 2026

[41] [41]

Qwen/qwen2.5-coder-7b-instruct-gguf

Qwen. Qwen/qwen2.5-coder-7b-instruct-gguf. https://huggingface.co/Qwen/Qwen2. 5-Coder-7B-Instruct-GGUF, 2024. Accessed 2026-05-07. 12

work page 2024

[42] [42]

bartowski/qwen2.5-coder-7b-instruct-abliterated-gguf

bartowski. bartowski/qwen2.5-coder-7b-instruct-abliterated-gguf. https://huggingface. co/bartowski/Qwen2.5-Coder-7B-Instruct-abliterated-GGUF , 2024. Accessed 2026-05-07

work page 2024

[43] [43]

bartowski/meta-llama-3.1-8b-instruct-gguf

bartowski. bartowski/meta-llama-3.1-8b-instruct-gguf. https://huggingface.co/ bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, 2024. Accessed 2026-05-07

work page 2024

[44] [44]

bartowski/meta-llama-3.1-8b-instruct-abliterated-gguf

bartowski. bartowski/meta-llama-3.1-8b-instruct-abliterated-gguf. https://huggingface. co/bartowski/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF , 2024. Accessed 2026-05-07

work page 2024

[45] [45]

Trevorjs/gemma-4-31b-it-uncensored-gguf

TrevorJS. Trevorjs/gemma-4-31b-it-uncensored-gguf. https://huggingface.co/ TrevorJS/gemma-4-31B-it-uncensored-GGUF, 2026. Accessed 2026-05-07

work page 2026

[46] [46]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Joh...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[47] [47]

The GEM benchmark: Natural language generation, its evaluation and metrics

Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ond ˇrej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Y...

work page 2021

[48] [48]

Inspect AI: Framework for large language model evaluations

UK AI Safety Institute. Inspect AI: Framework for large language model evaluations. https: //inspect.aisi.org.uk/, 2024. Accessed 2026-05-07

work page 2024

[49] [49]

Evaluating frontier models for dangerous capabilities,

Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca ...

work page arXiv 2024

[50] [50]

Model evaluation for extreme risks

Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, and Allan Dafoe. Model evaluation for extreme risks.arXiv prepr...

work page arXiv 2023

[51] [51]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stef...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[52] [52]

Release Strategies and the Social Impacts of Language Models

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-V oss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse, Jason Blazakis, Kris McGuffie, and Jasmine Wang. Release strategies and the social impacts of language models.arXiv preprint arXiv:1908.09203, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908

[53] [53]

Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools [Internet]

Jonas B. Sandbrink. Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools.arXiv preprint arXiv:2306.13952, 2023

work page arXiv 2023

[54] [54]

Common weakness enumeration

MITRE Corporation. Common weakness enumeration. https://cwe.mitre.org/, 2024. Accessed 2026-05-07

work page 2024

[55] [55]

OWASP Top 10: The ten most critical web application security risks

OWASP Foundation. OWASP Top 10: The ten most critical web application security risks. https://owasp.org/www-project-top-ten/, 2021. Accessed 2026-05-07

work page 2021

[56] [56]

llama.cpp: LLM inference in C/C++

Georgi Gerganov and contributors. llama.cpp: LLM inference in C/C++. https://github. com/ggerganov/llama.cpp, 2023. Accessed 2026-05-07

work page 2023

[57] [57]

The Hugging Face Hub: Machine learning collaboration platform

Hugging Face. The Hugging Face Hub: Machine learning collaboration platform. https: //huggingface.co/docs/hub/index, 2024. Accessed 2026-05-07

work page 2024

[58] [58]

JSON Schema draft 2020-12

JSON Schema Organization. JSON Schema draft 2020-12. https://json-schema.org/ draft/2020-12/json-schema-core.html, 2020. Accessed 2026-05-07. 14 A Artifact and Reproduction The artifact consists of the evaluation package security_agent_eval/, task catalogs in tasks/, model endpoint and GGUF-provenance configs in configs/, saved traces in runs/, gener- ate...

work page 2020