pith. sign in

arxiv: 2602.04165 · v2 · pith:NZPPKNG4new · submitted 2026-02-04 · 💻 cs.SE

PoC-Gym: Towards More Reliable LLM-Assisted Proof-of-Concept Exploit Generation

Pith reviewed 2026-05-16 08:01 UTC · model grok-4.3

classification 💻 cs.SE
keywords PoC generationLLM-assisted securityJava CVEsexploit generationstatic analysisdynamic validationvulnerability reproduction
0
0 comments X

The pith

PoC-Gym generates post-hoc valid PoCs for 12 of 20 Java CVEs by requiring candidates to reach ground-truth vulnerable locations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PoC-Gym, an iterative pipeline that feeds LLMs with CVE-specific prompts, static execution traces, and coverage feedback to produce PoC candidates for Java security vulnerabilities. Each candidate must satisfy three runtime checks—complete execution, a manifest success signal, and arrival at the sink of the target trace—before it is counted as valid. Across 338 runs on 20 CVEs the method yields 65 post-hoc validated PoCs covering 12 vulnerabilities, and on the 14-CVE subset shared with prior work it succeeds on 8 cases versus 5. The evaluation also surfaces many runtime-valid but post-hoc-invalid PoCs, which the authors analyze to identify recurring failure patterns.

Core claim

By combining static traces with coverage-driven iterative prompting and enforcing that every accepted PoC reaches the ground-truth vulnerable location after a runtime-valid execution, PoC-Gym produces more reliable LLM-generated exploits than earlier single-signal approaches.

What carries the argument

The PoC-Gym pipeline, which augments LLM generation with static traces, coverage feedback, and a three-stage runtime filter that checks execution completeness, success signals, and sink reachability.

If this is right

  • Automated PoC generation for Java CVEs becomes feasible for a larger fraction of reported vulnerabilities.
  • Security researchers gain a concrete set of failure modes to target in future validation designs.
  • Multi-stage filters that include static sink reachability can serve as a baseline for comparing new LLM prompting strategies.
  • The gap between runtime-valid and post-hoc-valid outputs quantifies the remaining reliability challenge for LLM exploit generators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same static-plus-dynamic validation structure could be tested on other languages once equivalent ground-truth traces are available.
  • Pairing PoC-Gym with directed fuzzers might close part of the runtime-valid to post-hoc-valid gap.
  • The identified failure patterns suggest that future work should add semantic or taint-based checks beyond location reachability.

Load-bearing premise

Reaching the reported vulnerable code location after a runtime-valid run is enough to confirm that the PoC actually triggers the intended vulnerability.

What would settle it

A PoC that passes all runtime checks and arrives at the ground-truth sink yet fails to trigger the reported CVE when executed against an instrumented build that logs the exact vulnerable path.

Figures

Figures reproduced from arXiv: 2602.04165 by Amartya Das, Claire Wang, Derin Gezgin, Nevena Stojkovic, Shinhae Kim, Zhengdong Huang.

Figure 1
Figure 1. Figure 1: Overview of the POC-GYM pipeline which consists of three main stages: prompt con￾struction, PoC generation, and PoC validation with feedback. panied by an available PoC exploit (Householder et al., 2020). A detailed discussion of prior work on automated PoC generation and LLM-assisted vulnerability exploitation is provided in Appendix A. We present POC-GYM, a system for generating Java PoC exploits using L… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of the manual analysis results for the multi-trace runs. The plain run results [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Detailed results of the manual analysis pipeline for the no-trace experiment results. [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
read the original abstract

Recently Large Language Models (LLMs) have been used in security-related tasks, including generating proof-of-concept (PoC) exploits. Several LLM-assisted approaches have been proposed; they typically generate PoCs from vulnerability descriptions and use additional guidance. But, such approaches are often ineffective because the signals-such as printed markers, generated files, or runtime side effects-that they use for validation may not imply that the vulnerability is triggered. Research for more reliable PoC generation is in need but yet remains challenging. We propose PoC-Gym, a pipeline for LLM-based PoC generation for Java security vulnerabilities. PoC-Gym uses both static and dynamic information, e.g., CVE-tailored prompts, static traces, and coverage-based feedback, and iteratively generates PoC candidates. Each candidate goes through a series of validations: whether the execution is complete, manifests a success signal, and reaches the sink of the target trace. We evaluate PoC-Gym using 20 Java CVEs. Across 338 runs, 116 candidates pass PoC-Gym's runtime validation and 65 candidates pass post-hoc validation against the ground-truth vulnerable locations, covering 12 of the 20 CVEs. On the 14-CVE overlap with FaultLine, the strongest PoC-Gym configuration is post-hoc valid for 8 CVEs, while FaultLine reports success for 5 CVEs under its original evaluation criterion. But, given the complexity of PoC generation, PoC-Gym also generates many runtime-valid but post-hoc-invalid PoCs. To better understand how to achieve more reliable PoC generation, we present an in-depth analysis of such PoCs and identify common sources of failures. We believe that our work provides insights for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PoC-Gym, a pipeline for LLM-assisted generation of proof-of-concept exploits targeting Java security vulnerabilities. The approach integrates CVE-specific prompts, static traces, and coverage feedback in an iterative process. PoC candidates undergo validation for complete execution, presence of success signals, and reachability to the ground-truth sink location. On 20 Java CVEs with 338 total runs, the pipeline produces 116 runtime-valid candidates and 65 post-hoc valid PoCs that cover 12 CVEs. In a direct comparison on 14 overlapping CVEs, the strongest PoC-Gym configuration achieves post-hoc validity for 8 CVEs compared to FaultLine's 5 under its original criterion. The work also analyzes common failure modes in the generated PoCs.

Significance. Should the validation approach prove sufficient to confirm actual exploit triggering, this work would contribute empirical evidence that structured LLM pipelines with static-dynamic feedback can improve reliability in PoC generation over prior methods. The analysis of runtime-valid but post-hoc-invalid cases provides useful insights into challenges for future research in this area.

major comments (2)
  1. [Evaluation] The post-hoc validation criterion (reaching the ground-truth vulnerable location after runtime-valid execution) is load-bearing for the central claims of 65 valid PoCs and coverage of 12 CVEs. However, this only confirms reachability to the sink and does not verify that the PoC actually triggers the reported vulnerability (e.g., via the correct data flow or state condition rather than an incidental path). The presence of 51 runtime-valid but post-hoc-invalid candidates highlights the risk of overcounting successes.
  2. [Comparison with FaultLine] The comparison on the 14-CVE overlap reports PoC-Gym succeeding on 8 vs. FaultLine's 5, but the success criteria differ (post-hoc reachability vs. FaultLine's original metric). This makes direct performance claims difficult to interpret without a unified evaluation.
minor comments (2)
  1. [Abstract] The abstract mentions '338 runs' but does not specify how many runs per CVE or configuration details.
  2. [Methodology] Details on the exact LLM models used, prompt templates, and coverage feedback mechanism could be expanded for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below, clarifying our validation approach and comparison methodology while incorporating revisions to improve transparency.

read point-by-point responses
  1. Referee: [Evaluation] The post-hoc validation criterion (reaching the ground-truth vulnerable location after runtime-valid execution) is load-bearing for the central claims of 65 valid PoCs and coverage of 12 CVEs. However, this only confirms reachability to the sink and does not verify that the PoC actually triggers the reported vulnerability (e.g., via the correct data flow or state condition rather than an incidental path). The presence of 51 runtime-valid but post-hoc-invalid candidates highlights the risk of overcounting successes.

    Authors: We agree that sink reachability after runtime-valid execution serves as a proxy rather than definitive proof of vulnerability triggering via the precise data flow or state. This proxy was chosen because it leverages the known ground-truth vulnerable location from the CVE trace, providing a more reliable signal than the marker-based or side-effect validations used in prior LLM-assisted PoC work. The 51 runtime-valid but post-hoc-invalid cases are explicitly analyzed in the manuscript to surface common failure modes (e.g., incorrect control flow or missing preconditions), which we believe offers value to the community. In the revised manuscript we have expanded the discussion in Sections 4.3 and 5 to explicitly state this limitation, note that full exploit confirmation would require additional dynamic analyses such as taint tracking, and qualify the 12-CVE coverage claim accordingly. revision: yes

  2. Referee: [Comparison with FaultLine] The comparison on the 14-CVE overlap reports PoC-Gym succeeding on 8 vs. FaultLine's 5, but the success criteria differ (post-hoc reachability vs. FaultLine's original metric). This makes direct performance claims difficult to interpret without a unified evaluation.

    Authors: We acknowledge that the differing criteria (post-hoc sink reachability for PoC-Gym versus FaultLine's original success metric) limit the strength of direct performance claims. The manuscript already reports FaultLine results strictly under its published criterion to avoid misrepresenting prior work. In the revision we have added a clarifying paragraph in the evaluation section that explicitly contrasts the two metrics, cautions against over-interpreting the 8-vs-5 numbers, and explains why a unified re-evaluation was not performed (lack of public implementation details for FaultLine). We maintain that presenting each system under its native criterion remains the most transparent approach given the current state of the field. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical counts from CVE runs

full rationale

The paper's central claims consist of empirical counts (65 post-hoc valid PoCs covering 12 of 20 CVEs, 8/14 vs FaultLine) obtained by executing the described pipeline on public Java CVEs and applying the stated validation steps (runtime completion, success signal, sink reachability). No equations, fitted parameters, or derivations are present that reduce any result to its inputs by construction. Self-citations, if any, are not load-bearing for a uniqueness theorem or ansatz; the evaluation is externally falsifiable against the CVE corpus and does not rename known results or smuggle assumptions via prior work. This is the standard case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach assumes that CVE descriptions plus static traces are sufficient context for an LLM to produce candidate PoCs and that post-hoc location matching is a reliable proxy for exploit success. No new physical or mathematical entities are introduced.

axioms (2)
  • domain assumption LLM outputs can be iteratively refined using coverage and trace feedback to reach a target sink
    Invoked when describing the iterative generation loop in the abstract.
  • domain assumption Reaching the ground-truth vulnerable location after runtime validation implies the PoC triggers the reported vulnerability
    Central to the post-hoc validation claim.

pith-pipeline@v0.9.0 · 5638 in / 1352 out tokens · 30883 ms · 2026-05-16T08:01:56.892069+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    touch /tmp/ code-injected

    **Goal** - A *single, concrete * action or outcome that proves exploitation (e.g., executing ‘Runtime.getRuntime().exec("touch /tmp/ code-injected")‘)

  2. [2]

    xss-java

    **Validation** - One *specific* programmatic check that confirms the goal (e.g., verify that ‘/tmp/code-injected‘ exists). Return exactly two sections in this format: ‘‘‘ ## Goal <goal sentence> ## Validation - check ‘‘‘ Listing 1: Prompt template for generating CVE-specific vulnerability criteria using an LLM ## Goal Execute the command ‘touch /tmp/comma...

  3. [3]

    Provide **exactly one ** Java source file named ‘PoCTest.java‘ containing a public class ‘PoCTest‘ with a ‘main(String[] args)‘ method

  4. [4]

    Do not rely on CLI arguments to switch behaviour; simply run the exploit path and report ‘[VULN]‘ on success

    The program must craft one or more malicious payloads that exploit the bug on the vulnerable revision and print a line containing ‘[VULN]‘ when the vulnerability is successfully triggered. Do not rely on CLI arguments to switch behaviour; simply run the exploit path and report ‘[VULN]‘ on success

  5. [5]

    ** As long as it is possible, do not import ‘java.lang.reflect

    Strict constraints on API usage: - **If possible, no reflection for any reason. ** As long as it is possible, do not import ‘java.lang.reflect. *‘, do not call ‘ setAccessible(true)‘, or anything that would be considered reflection. - **Do NOT attempt to re-implement or extend complex library classes / interfaces. ** Assume every library class already exi...

  6. [6]

    Avoid randomness and networking; rely only on JDK + target library

  7. [7]

    Output format - ***strictly*** fenced blocks with explicit filename: ‘‘‘java [PoCTest.java] // your Java code here ‘‘‘ Do **NOT** output the Bash script or any other files. ENTRYPOINT --> GOAL rule - Start from a realistic public entrypoint in the program (controller handler, service API, exported library function, CLI main) and drive inputs along normal ...

  8. [8]

    <module>/target/ *.jar)

    Every JAR produced by the Maven build (e.g. <module>/target/ *.jar)

  9. [9]

    jar - No other project modules are on the class-path

    Third-party dependencies downloaded into <module>/target/libs/ *. jar - No other project modules are on the class-path. - Spring-Boot, Testcontainers, H2, JUnit, etc. are *NOT* present unless they are direct dependencies of the built module. - Therefore you *MUST NOT * import packages such as ‘org.springframework .*‘, ‘org.h2. *‘, ‘org.junit. *‘, or class...

  10. [10]

    QueryGenerator.java:236 [NOT REACHED]

  11. [11]

    QueryGenerator.java:270 [NOT REACHED]

  12. [12]

    SqlInjectionUtil.java:55 [NOT REACHED]

  13. [13]

    SqlInjectionUtil.java:56 [REACHED]

  14. [14]

    SqlInjectionUtil.java:65 [NOT REACHED]

  15. [16]

    SqlInjectionUtil.java:70 [REACHED]

  16. [18]

    SqlInjectionUtil.java:72 [REACHED]

  17. [19]

    malicious.zip

    SqlInjectionUtil.java:78 [REACHED] Source/Sink Status: Source Hit: FALSE (QueryGenerator.java:236) Sink Hit: TRUE (SqlInjectionUtil.java:78) Coverage: 4/8 steps (50.0%) Listing 11: Dynamic execution trace summary indicating trace coverage for the SQL Injection vulnerability. If any condition fails, the PoC is marked as invalid and a concise failure summar...