pith. sign in

arxiv: 2605.29115 · v1 · pith:6OZKL7LJnew · submitted 2026-05-27 · 💻 cs.CR · cs.AI

unix-ctf: Procedural Environments for Unix-Competence Reinforcement Learning

Pith reviewed 2026-06-29 11:16 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords unix competenceshell agentscapture the flagprocedural generationreinforcement learninglanguage model trainingterminal benchmarksforensics
0
0 comments X

The pith

A procedural generator of Unix CTF tasks isolates shell skills and raises fine-tuned model solve rates from 11.6% to 43.6% on holdouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds unix-ctf as a way to generate many capture-the-flag tasks each focused on one Unix feature inside a container. An LLM helps create hide and find scripts that pass checks ensuring the flag is hidden properly and can be found without traces left behind. Training a language model on these tasks improves its ability to solve Unix-related problems, as shown by better results on a held-out set of skills and on the InterCode-CTF benchmark especially in forensics. This supports the idea that Unix competence is a distinct skill that can be trained directly.

Core claim

Unix competence is the ability to use shell and operating-system primitives as first-class tools. unix-ctf produces tasks by an LLM-assisted synthesis pipeline that generates candidate hiding techniques, rewrites them into parameterized hide-and-find script pairs, and filters them with a bidirectional contract. The pipeline yields 656 valid variants from 750 attempts. Fine-tuning Qwen3-8B with LoRA using GRPO on this surface lifts solve rate from 11.6% to 43.6% on a 15-skill multi-family holdout (n=225), redistributes which InterCode-CTF tasks the model solves, and produces a +33 pp gain in Forensics while reaching 32/100 on InterCode-CTF.

What carries the argument

The bidirectional contract in the LLM-assisted synthesis pipeline, which requires hide scripts to leave no plaintext trace of the flag and find scripts to recover it in a fresh directory.

If this is right

  • The generated tasks achieve an 87.5% success rate under the contract checks.
  • Fine-tuning redistributes success across InterCode-CTF tasks.
  • The approach yields a 33 percentage point gain in the Forensics category.
  • Overall performance reaches 32 out of 100 on InterCode-CTF after training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method might allow creation of similar training sets for other specific computing skills.
  • Future benchmarks could separate Unix competence from programming skills to better measure progress.
  • Agents trained this way could handle real shell tasks more reliably without needing full programming knowledge.

Load-bearing premise

The tasks generated by the pipeline test actual Unix competence without artifacts from the LLM synthesis or the fixed container layout.

What would settle it

A test showing that models trained on unix-ctf perform no better than untrained models on a collection of Unix tasks created without using the generator would falsify the separability and trainability claim.

Figures

Figures reproduced from arXiv: 2605.29115 by Augustine N. Mavor-Parker, Geoffrey Bradway, Lorenz Wolf, Matthew James Sargent, Maxwill Lin, Roger Creus Castanyer.

Figure 1
Figure 1. Figure 1: The unix-ctf pipeline. Harvest (left, run offline once) builds a canonicalized library of 155 techniques and 441 variants through a five-stage cycle that alternates frontier-model exploration, mechanical verification, smaller-model synthesis, portability validation, and dedup-based canonicalization. Env-gen (centre, run every training batch) draws a sticky pool of 50 techniques from the library, samples ei… view at source ↗
Figure 2
Figure 2. Figure 2: What terminal benchmarks actually measure. Each row: share of tasks classified by gpt-5.4-nano (full rubric, Appendix M). Unix-competence: non-obvious OS-/format-internal feature with no clean Python analogue. Shell￾flavored coding: glue using find/grep/awk/sed/sqlite3/tar (Python solve scripts wrapping such glue count here, not as general programming). General programming: non-trivial programs (ML, interp… view at source ↗
Figure 3
Figure 3. Figure 3: The technique library. (a) Coverage of the Unix-competence surface across 16 post-hoc families. The largest families (configuration files, text-processing tools, text encoding/ciphers) are general-purpose surfaces; the smaller families (binary/ELF, X.509, IPC primitives) are exactly the corners that distinguish Unix fluency from coding. (b) Production funnel from 750 raw harvest attempts to 656 portable sc… view at source ↗
Figure 4
Figure 4. Figure 4: Multi-family holdout, 𝑛=225 single-flag environments per variant, 18-turn budget. (a) Aggregate solve rate with Wilson 95% intervals. Both GRPO initializations lift substantially over base; GRPO from an SFT-initialized checkpoint reaches 43.6%, a 3.8× gain over base. (b) Mean turns per episode (max 18). The SFT-initialized variant uses the budget more efficiently. GRPO (SFT init) GRPO (base init) 0 5 10 15… view at source ↗
Figure 5
Figure 5. Figure 5: InterCode-CTF capability redistribution. (a) Each trained variant’s IC-CTF-100 results decomposed against base into held (both solve), lost (base only), and gained (trained only); the dotted line marks the base solve rate. (b) Per-category solve rate. than comparable generation methods by procedurally constructing tasks from a canonicalized library of CTF techniques. Finally, we demonstrate that the skills… view at source ↗
Figure 6
Figure 6. Figure 6: Per-batch GRPO training reward over forty batches. Solid blue: GRPO-from-SFT (peaks at+27.0). Dashed red: GRPO-from-base (peaks at +1.45). Despite the training-reward gap, GRPO-from-base reaches 32% on InterCode-CTF and GRPO-from-SFT only 23%; the variant ranking inverts. GRPO-from-SFT climbs to roughly 2.4 flags per rollout while GRPO-from-base barely escapes zero ( [PITH_FULL_IMAGE:figures/full_fig_p018… view at source ↗
read the original abstract

Unix competence is the ability to use shell and operating-system primitives as first-class tools, not merely to write programs through a terminal. Current terminal benchmarks tend to blur this distinction: a solver fluent in Python but weak in Unix can pass a substantial fraction of Terminal-Bench 2.0, while the reverse skill profile is rarely exercised. We make the distinction operational and build a training surface for the Unix component. unix-ctf is a procedural generator of capture-the-flag tasks for shell agents. Each task hides a short token (a flag of the form flag(a3b1c9...)) inside a fresh Linux container using a single Unix feature, and the agent must recover it. Tasks are produced by an LLM-assisted synthesis pipeline that generates candidate hiding techniques, rewrites them into parameterized hide-and-find script pairs, and filters them with a bidirectional contract: the hide script must leave no plaintext trace of the flag on disk, and the find script must recover the flag in a fresh directory. Because the LLM only writes the planting and recovery steps (the container, layout, and grading harness are fixed), the pipeline lands 656 of 750 raw attempts as portable, reusable variants (87.5\%). Our reproduction of Endless Terminals' full-container-generation approach lands only 17.4\% under the same checks. The 656 variants canonicalize to 155 distinct techniques. Fine-tuning Qwen3-8B with LoRA using GRPO on this surface lifts solve rate from 11.6\% to 43.6\% on a 15-skill multi-family holdout (n=225), redistributes which InterCode-CTF tasks the model solves, and produces a +33 pp gain in Forensics while reaching 32/100 on InterCode-CTF. These results suggest that Unix competence is separable, trainable, and best evaluated directly rather than folded into programming-through-a-shell.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces unix-ctf, a procedural generator of Unix CTF tasks for training shell agents on Unix competence. Tasks hide a flag using a single Unix feature inside a fixed Linux container via an LLM-assisted pipeline that produces hide-and-find script pairs, filtered by a bidirectional contract (no plaintext flag trace after hiding; full recoverability by the find script). The pipeline achieves 87.5% yield (656/750) versus 17.4% for a reproduced full-container baseline, yielding 155 distinct techniques. Fine-tuning Qwen3-8B with LoRA + GRPO raises solve rate from 11.6% to 43.6% on a 15-skill multi-family holdout (n=225), redistributes solved InterCode-CTF tasks, delivers a +33 pp gain in Forensics, and reaches 32/100 on InterCode-CTF. The work concludes that Unix competence is separable, trainable, and best evaluated directly.

Significance. If the reported gains reflect acquisition of genuine, separable Unix skills, the work supplies a high-yield, reproducible procedural training surface for RL on terminal agents that isolates Unix primitives from general programming. The fixed harness, bidirectional contract, and direct comparison to prior generation methods are concrete strengths that support reproducibility and controlled experimentation.

major comments (2)
  1. [Abstract / Results] Abstract / Results: the reported performance figures (11.6% o 43.6% on the 15-skill holdout, +33 pp Forensics gain, 32/100 on InterCode-CTF) are presented without error bars, holdout-construction details, statistical tests, or description of how the 15 skill families were selected and balanced. These omissions make the central empirical claims difficult to evaluate for reliability.
  2. [Task Generation Pipeline] Task Generation Pipeline: the bidirectional contract only enforces absence of plaintext flag traces and recoverability; it does not constrain statistical regularities in the 155 techniques, directory-naming conventions, permission patterns, or command sequences produced by the fixed container/layout. Because training and the internal holdout are sampled from the same generator, shared artifacts could produce the observed lift and InterCode-CTF redistribution without the model having acquired transferable Unix competence. A targeted test (e.g., distribution-shifted holdout or adversarial variants) is needed to support the separability claim.
minor comments (2)
  1. [Abstract] The canonicalization step that reduces 656 variants to 155 distinct techniques would benefit from an explicit description of the deduplication criteria.
  2. [Evaluation] Clarify whether the 15-skill families were chosen to ensure balanced coverage or based on other criteria.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing evaluation rigor and the need to rule out generator-specific artifacts. We address both major comments below and will incorporate revisions to strengthen the empirical claims.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract / Results: the reported performance figures (11.6% o 43.6% on the 15-skill holdout, +33 pp Forensics gain, 32/100 on InterCode-CTF) are presented without error bars, holdout-construction details, statistical tests, or description of how the 15 skill families were selected and balanced. These omissions make the central empirical claims difficult to evaluate for reliability.

    Authors: We agree that the absence of error bars, statistical tests, and explicit holdout-construction details weakens the presentation of the central results. In the revised manuscript we will add: (1) bootstrap 95% confidence intervals computed over 1000 resamples for all solve-rate figures; (2) a dedicated subsection detailing how the 15 skill families were chosen (maximal coverage of Unix primitives while ensuring no overlap with training families) and balanced by task count; and (3) results of McNemar’s test for paired pre-/post-fine-tuning comparisons on the holdout. These additions will make the reliability of the reported gains directly assessable. revision: yes

  2. Referee: [Task Generation Pipeline] Task Generation Pipeline: the bidirectional contract only enforces absence of plaintext flag traces and recoverability; it does not constrain statistical regularities in the 155 techniques, directory-naming conventions, permission patterns, or command sequences produced by the fixed container/layout. Because training and the internal holdout are sampled from the same generator, shared artifacts could produce the observed lift and InterCode-CTF redistribution without the model having acquired transferable Unix competence. A targeted test (e.g., distribution-shifted holdout or adversarial variants) is needed to support the separability claim.

    Authors: The referee correctly notes that the bidirectional contract does not eliminate the possibility of low-level distributional artifacts shared between training and the internal holdout. While the 15-skill holdout already uses disjoint technique families and the external InterCode-CTF results show redistribution and a +33 pp Forensics gain on a benchmark with different task distributions, these do not fully substitute for an explicit distribution-shift experiment. In revision we will add an explicit limitations paragraph acknowledging this gap and, resources permitting, report a small-scale distribution-shifted probe (e.g., tasks with altered directory-naming conventions or permission patterns) to quantify sensitivity to generator regularities. We believe the current InterCode-CTF transfer evidence still supports separability but will present it more cautiously. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical measurements on procedurally generated tasks

full rationale

The paper describes an empirical pipeline: LLM-assisted synthesis of hide/find script pairs under a bidirectional contract, canonicalization to 155 techniques, and RL fine-tuning (GRPO + LoRA) whose outcomes are measured as solve-rate lifts on a held-out 15-skill set and on InterCode-CTF. No equations, fitted parameters renamed as predictions, or derivation steps appear. The 87.5 % yield figure is a direct count of pipeline success, not a self-referential claim. Self-citation is absent; the comparison to Endless Terminals is external reproduction. Claims rest on observed performance deltas rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no equations or explicit parameters are stated. The pipeline implicitly assumes the LLM can generate valid hide/find pairs and that the fixed container harness does not leak information.

axioms (1)
  • domain assumption The LLM-assisted synthesis produces portable, reusable variants that satisfy the bidirectional contract without introducing systematic bias in task difficulty or coverage.
    Invoked when claiming 656 valid tasks and downstream training gains; no independent verification of the contract is described.

pith-pipeline@v0.9.1-grok · 5903 in / 1417 out tokens · 23929 ms · 2026-06-29T11:16:43.231975+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 18 canonical work pages · 10 internal anchors

  1. [1]

    Windows agent arena: Evaluating multi-modal os agents at scale, 2024

    URL https://www.anthropic.com/claude/ opus. Model identifier:claude-opus-4-7. Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Y adong Lu, Justin W agle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows agent arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,

  2. [2]

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del V erme, T om Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David V azquez, Nicolas Chapados, and Alexandre Lacoste. W orkArena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718,

  3. [3]

    Goodman, and Dimitris Papailiopoulos

    Kanishk Gandhi et al. Endless terminals: Scaling rl environments for terminal agents.arXiv preprint arXiv:2601.16443,

  4. [4]

    LoRA: Low-Rank Adaptation of Large Language Models

    arXiv:2106.09685. Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2E-Gym: Procedural environ- ments and hybrid verifiers for scaling open-weights SWE agents.arXiv preprint arXiv:2504.07164,

  5. [5]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    arXiv:2310.06770. Xiao Liu et al. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representations (ICLR),

  6. [6]

    arXiv:2308.03688. Mike A. Merrill et al. T erminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868,

  7. [7]

    Training Software Engineering Agents and Verifiers with SWE-Gym

    PMLR 267; arXiv:2412.21139. Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, and W ei Ping. On data engineering for scaling LLM terminal capabilities.arXiv preprint arXiv:2602.21193,

  8. [8]

    W AL T: W eb agents that learn tools.arXiv preprint arXiv:2510.01524,

    Viraj Prabhu et al. W AL T: W eb agents that learn tools.arXiv preprint arXiv:2510.01524,

  9. [9]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan W altz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, W ei Li, Folawiyo Campbell-Ajala, Daniel T oyama, Robert Berry, Divya T yamagundlu, Timothy Lillicrap, and Oriana Riva. AndroidW orld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,

  10. [10]

    NYU CTF Bench: A scalable open-source benchmark dataset for evaluating LLMs in offensive security.arXiv preprint arXiv:2406.05590, 2024a

    Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. NYU CTF Bench: A scalable open-source benchmark dataset for evaluating LLMs in offensive security.arXiv preprint arXiv:2406.05590, 2024a. Zhihon...

  11. [11]

    Os-copilot: Towards generalist computer agents with self-improvement, 2024

    Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin W eng, Zhoumianze Liu, Shunyu Y ao, T ao Yu, and Lingpeng Kong. OS-Copilot: T owards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456,

  12. [12]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, T oh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and T ao Yu. OSW orld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972,

  13. [13]

    Qwen3 Technical Report

    An Y ang et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. John Y ang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Y ao. InterCode: Standardizing and benchmarking interactive coding with execution feedback. InNeurIPS Datasets and Benchmarks Track,

  14. [14]

    John Y ang, Kilian Lieret, Carlos E

    arXiv:2306.14898. John Y ang, Kilian Lieret, Carlos E. Jimenez, Alexander W ettig, Kabir Khandpur, Y anzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Y ang. SWE-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798, 2025b. Andy K. Zhang, Neil Perry, Riya Dulepet, et al. Cybench: A framework for evaluating cybers...

  15. [15]

    SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

    Boyuan Zheng et al. SkillW eaver: W eb agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079,

  16. [16]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Y onatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. W ebArena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,

  17. [17]

    Self-challenging language model agents.arXiv preprint arXiv:2506.01716,

    Yifei Zhou et al. Self-challenging language model agents.arXiv preprint arXiv:2506.01716,

  18. [18]

    CTF-Dojo: Executable-environment trajectory training for cyber agents, 2025

    T erry Yue Zhuo, Dingmin W ang, Hantian Ding, V arun Kumar, and Zijian W ang. T raining language model agents to find vulnerabilities with CTF-Dojo.arXiv preprint arXiv:2508.18370,

  19. [19]

    buildid.note

    + name + desc_padded with open("buildid.note", "wb") as f: f.write(note) PYEOF cp /bin/ls ./hidden_elf objcopy --update-section .note.gnu.build-id=buildid.note ./hidden_elf ./ hidden_elf_marked rm -f buildid.note hidden_elf mv hidden_elf_marked hidden_elf # recovery.sh readelf -n "$target_dir/hidden_elf" 2>/dev/null \ | grep -oE ’Build ID: [0-9a-f]+’ |awk...

  20. [20]

    Reward per rollout is points_earned− turn_cost×turns_used , with turn_cost= 1, per-technique points 15, and −1per incorrect flag submission

    adapters.SFT format pass.492 successful Haiku-solver trajectories against the unix-ctf library, ∼92 gradient updates on hosted Tinker; we use Tinker’s recipe defaults for optimizer, learning rate, and batch size at this stage and treat the pass as harness alignment rather than capability transfer.GRPO.40 batches, 8 groups × 8 rollouts (64 trajectories per...

  21. [21]

    category

    The same judge model (gpt-5.4-nano (OpenAI, 2026)) classifies each task into one of four skill categories. The per-task input is the instruction text, canonical solution, and grading specification. The output is {"category": "<one of four>", "reason": "<one sentence>"}. Discriminating question for unix_competence vs the others: "Could a competent Python p...