pith. sign in

arxiv: 2508.05674 · v2 · submitted 2025-08-05 · 💻 cs.CR · cs.AI

Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark

Pith reviewed 2026-05-19 01:27 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords LLM agentsoffensive securityCTF benchmarkhyperparameter tuningLLM as judgecybersecurityCapture the Flagagent evaluation
0
0 comments X

The pith

Tuned hyperparameters and an LLM-as-judge framework give a concrete recipe for more effective offensive security agents on CTF tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the main factors that determine whether LLM agents succeed at offensive security work such as solving Capture the Flag challenges. It introduces CTFJudge to let a second LLM review full agent trajectories and score progress at each step, along with a new CTF Competency Index that measures how closely solutions match expert gold standards. The authors also test the effects of temperature, top-p, and maximum token length on planning quality and release a compact benchmark of fifty representative challenges. A sympathetic reader would care because clearer guidance on these choices could make automated security testing more reliable and easier to reproduce. The work ends by identifying multi-agent settings that improve results across binary exploitation, web, reverse engineering, forensics, and cryptography tasks.

Core claim

We systematically investigate the key factors that drive agent success and provide a detailed recipe for building effective LLM-based offensive security agents. This includes the CTFJudge framework leveraging LLM as a judge to analyze trajectories, the CTF Competency Index for partial correctness against human solutions, and examination of how hyperparameters influence performance and task planning, all demonstrated on the CTFTiny benchmark.

What carries the argument

CTFJudge, the framework that uses an LLM as a judge to deliver granular evaluation of agent trajectories across each step of CTF solving.

If this is right

  • Specific ranges for temperature, top-p, and maximum token length improve agent success and planning quality on cybersecurity tasks.
  • Multi-agent coordination settings can be chosen to raise overall performance when the recipe is followed.
  • The CTF Competency Index supplies a finer-grained score than binary pass/fail for comparing agent outputs to expert solutions.
  • CTFTiny supports quick iteration when testing new agent designs on representative challenges from five security categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tuning and evaluation approach could shorten the time required to prototype LLM assistants that support human penetration testers.
  • Applying the judge framework to defensive security tasks or malware analysis might expose analogous performance drivers.
  • Public release of the fifty-challenge set invites direct head-to-head comparisons among different agent architectures.

Load-bearing premise

An LLM acting as a judge can deliver reliable, granular, and unbiased analysis of agent trajectories across CTF solving steps.

What would settle it

Human experts reviewing the same set of agent trajectories and producing ratings that differ substantially from CTFJudge scores on many challenges would undermine the evaluation approach.

read the original abstract

Recent advances in LLM agentic systems have improved the automation of offensive security tasks, particularly for Capture the Flag (CTF) challenges. We systematically investigate the key factors that drive agent success and provide a detailed recipe for building effective LLM-based offensive security agents. First, we present CTFJudge, a framework leveraging LLM as a judge to analyze agent trajectories and provide granular evaluation across CTF solving steps. Second, we propose a novel metric, CTF Competency Index (CCI) for partial correctness, revealing how closely agent solutions align with human-crafted gold standards. Third, we examine how LLM hyperparameters, namely temperature, top-p, and maximum token length, influence agent performance and automated cybersecurity task planning. For rapid evaluation, we present CTFTiny, a curated benchmark of 50 representative CTF challenges across binary exploitation, web, reverse engineering, forensics, and cryptography. Our findings identify optimal multi-agent coordination settings and lay the groundwork for future LLM agent research in cybersecurity. We make CTFTiny open source to public https://github.com/NYU-LLM-CTF/CTFTiny along with CTFJudge on https://github.com/NYU-LLM-CTF/CTFJudge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CTFJudge, an LLM-as-a-judge framework for granular analysis of agent trajectories on Capture the Flag challenges; proposes the CTF Competency Index (CCI) to quantify partial correctness relative to human gold standards; presents CTFTiny, a curated benchmark of 50 challenges spanning binary exploitation, web, reverse engineering, forensics, and cryptography; and reports experiments on the effects of hyperparameters (temperature, top-p, maximum token length) and multi-agent coordination on LLM agent performance in offensive security tasks, culminating in a recipe for effective agents. The artifacts are released as open source.

Significance. If the central claims hold, particularly the reliability of the automated judge, the work would supply a practical lightweight benchmark and evaluation protocol that could accelerate reproducible research on LLM agents for cybersecurity. The open-sourcing of CTFTiny and CTFJudge is a clear strength supporting community follow-up. The empirical focus on hyperparameter effects and multi-agent setups offers potentially actionable guidance, though its value depends on the soundness of the evaluation pipeline.

major comments (2)
  1. [CTFJudge framework and evaluation methodology] The reliability of CTFJudge is load-bearing for the CCI metric, the hyperparameter sweeps, and the multi-agent recipe. No section describes calibration of the judge LLM against human CTF experts (e.g., inter-rater agreement, Cohen's kappa, or Pearson correlation on step-level correctness or partial-exploit judgments). Systematic biases in scoring ambiguous trajectories could invert which temperature or top-p values appear optimal.
  2. [Results and discussion of hyperparameter tuning] The manuscript claims CTFJudge delivers 'reliable, granular, and unbiased analysis' yet provides no quantitative evidence (such as agreement rates on a held-out set of trajectories) that the judge's step-level scores align with expert consensus. This directly affects the trustworthiness of the reported optimal settings and the 'detailed recipe'.
minor comments (2)
  1. [CTFJudge implementation details] Clarify the exact prompt templates and few-shot examples used by CTFJudge so that the judging process can be replicated or improved by others.
  2. [CTFTiny benchmark description] Specify the selection criteria and diversity metrics used to curate the 50 challenges in CTFTiny to substantiate the claim of representativeness across categories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We agree that the reliability of CTFJudge is central to our claims and will strengthen the paper by addressing the validation concerns. Below we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: [CTFJudge framework and evaluation methodology] The reliability of CTFJudge is load-bearing for the CCI metric, the hyperparameter sweeps, and the multi-agent recipe. No section describes calibration of the judge LLM against human CTF experts (e.g., inter-rater agreement, Cohen's kappa, or Pearson correlation on step-level correctness or partial-exploit judgments). Systematic biases in scoring ambiguous trajectories could invert which temperature or top-p values appear optimal.

    Authors: We acknowledge that a formal calibration of CTFJudge against human CTF experts was not included in the original submission. This is a valid point, as such validation would further support the trustworthiness of the CCI metric and the reported hyperparameter effects. In the revised manuscript, we will add a dedicated section on judge calibration. Specifically, we will report inter-rater agreement metrics (e.g., Cohen's kappa) and correlation with expert judgments on a held-out set of agent trajectories. This addition will directly address potential concerns about systematic biases. revision: yes

  2. Referee: [Results and discussion of hyperparameter tuning] The manuscript claims CTFJudge delivers 'reliable, granular, and unbiased analysis' yet provides no quantitative evidence (such as agreement rates on a held-out set of trajectories) that the judge's step-level scores align with expert consensus. This directly affects the trustworthiness of the reported optimal settings and the 'detailed recipe'.

    Authors: We appreciate this observation. The description of CTFJudge as providing reliable analysis was intended to reflect its design for granular step-level evaluation, but we agree that quantitative evidence of alignment with human experts is necessary to substantiate this. We will revise the manuscript to include quantitative validation results, such as agreement rates on a held-out trajectory set, to support the claims about the judge's performance and the resulting optimal settings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on newly introduced benchmark and judge framework

full rationale

The paper introduces external artifacts (CTFTiny benchmark of 50 CTF challenges and CTFJudge LLM-as-judge framework) and reports experimental results on hyperparameter effects and multi-agent coordination. No equations, metrics, or derivations reduce by construction to fitted parameters or self-referential definitions. CCI is defined as alignment with human-crafted gold standards on the new benchmark, and findings are presented as direct observations from runs rather than predictions forced by prior fits. Self-citations, if present, are not load-bearing for the central recipe or CCI claims. The derivation chain is self-contained against the introduced evaluation artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Contributions rest on the domain assumption that LLMs can serve as competent judges of security-task trajectories and that the 50-challenge set is representative of broader CTF difficulty.

axioms (1)
  • domain assumption LLM as a judge can provide accurate and granular evaluation of agent trajectories in CTF tasks
    Core premise of the CTFJudge framework described in the abstract.
invented entities (3)
  • CTFJudge no independent evidence
    purpose: Framework that uses an LLM to judge and score agent trajectories
    New evaluation system introduced by the paper
  • CTF Competency Index (CCI) no independent evidence
    purpose: Metric for measuring partial correctness against human gold standards
    Novel scoring method proposed in the paper
  • CTFTiny no independent evidence
    purpose: Lightweight benchmark consisting of 50 curated CTF challenges
    New test set created for rapid evaluation

pith-pipeline@v0.9.0 · 5799 in / 1315 out tokens · 52063 ms · 2026-05-19T01:27:46.702397+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Autonomous Adversary: Red-Teaming in the age of LLM

    cs.CR 2026-05 unverdicted novelty 5.0

    Expert-defined action plans for LLM agents achieve higher task completion in lateral-movement scenarios than fully autonomous or self-scaffolded modes, but failures remain common due to brittle commands and state handling.

  2. RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs

    cs.CR 2026-04 unverdicted novelty 4.0

    RAVEN combines LLM agents and RAG to generate Project Zero-style vulnerability reports, achieving 54.21% average quality on 105 NIST-SARD samples across 15 CWE types.