Recognition: no theorem link
From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection
Pith reviewed 2026-05-15 22:18 UTC · model grok-4.3
The pith
On-policy RL with GRPO outperforms SFT and off-policy methods for LLM-based vulnerability detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On-policy RL with GRPO consistently outperforms SFT, off-policy preference optimization methods, and specialized VD LLMs. For data curation, SFT with rejection sampling works better than rationalization-based supervision because the latter introduces hallucinations; in RL, difficulty-aware filtering reduces coverage due to skewed vulnerability difficulty and harms curriculum learning, while pair-based scheduling helps. Increasing SFT epochs benefits off-policy methods but excessive SFT suppresses self-exploration in on-policy RL. Binary correctness rewards cause hacking, whereas fine-grained root-cause judgments and specification-based rewards improve credit assignment and efficiency. LLM-as
What carries the argument
On-policy RL with GRPO, which optimizes the model through self-generated trajectories and relative comparisons within groups of outputs to improve vulnerability detection accuracy.
If this is right
- Rejection sampling for SFT data is more effective than rationalization because it avoids hallucinations in vulnerability explanations.
- Pair-based data scheduling outperforms difficulty-aware filtering because the latter shrinks coverage in the skewed vulnerability difficulty distribution.
- Moderate SFT epochs improve later off-policy optimization while excessive SFT reduces gains from on-policy RL self-exploration.
- Fine-grained root-cause rewards prevent hacking better than binary classification signals and support more reliable credit assignment.
- Specification-based rewards increase training efficiency at the cost of extra specification design effort.
Where Pith is reading between the lines
- The same post-training patterns may transfer to other code-related security tasks such as malware analysis or exploit generation.
- Automating the generation of specifications could reduce the design cost of specification-based rewards and make them more scalable.
- Direct comparisons against traditional static analysis tools on the same codebases would clarify whether the LLM gains translate into end-to-end security improvements.
Load-bearing premise
The experiments isolate the effects of each post-training method using matched data, model scale, and evaluation protocols so performance differences reflect the training approach itself.
What would settle it
Re-running the full pipeline on the same base models and vulnerability datasets but finding that SFT or off-policy methods match or exceed GRPO performance when data selection and evaluation are held exactly constant.
Figures
read the original abstract
The integration of LLMs into vulnerability detection (VD) has shifted the field toward more interpretable and context-aware analysis. While post-training techniques have shown promise in general coding tasks, their systematic application to VD remains underexplored. In this paper, we present the first comprehensive investigation into the post-training pipeline for LLM-based VD, demonstrating that on-policy RL with GRPO consistently outperforms SFT, off-policy preference optimization methods, and specialized VD LLMs. Our study further reveals VD-specific post-training guidelines and insights beyond common practices: (1) For data curation, contrary to the widespread use of rationalization-based supervision in prior VD work, SFT based on rejection sampling proves more effective, as rationalization can introduce hallucinations; in RL training, the inherently skewed difficulty distribution of vulnerabilities leads difficulty-aware data filtering to drastically reduce data coverage, causing non-negligible performance loss, and undermines curriculum learning, while pair-based data scheduling can partially mitigate this. (2) For stage interactions, unlike preference optimization typically applied to lightly trained SFT models, increasing SFT epochs consistently benefits off-policy preference optimization in VD tasks; however, excessive SFT suppresses self-exploration in on-policy RL, limiting its gains. (3) For reward mechanisms, naively treating vulnerability classification correctness as reward signals leads to reward hacking, whereas fine-grained root-cause judgments provide more reliable credit assignment; specification-based rewards further improve efficiency at the cost of additional design and generation effort. (4) For evaluation protocols, LLM-as-a-Judge based on root-cause analysis offers a more robust alternative, albeit with variability across judge models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper conducts the first systematic study of the post-training pipeline for LLM-based vulnerability detection. It claims that on-policy RL with GRPO consistently outperforms SFT, off-policy preference optimization, and existing specialized VD LLMs, while deriving four VD-specific insights: rejection sampling outperforms rationalization for SFT data curation; difficulty-aware filtering harms coverage and curriculum learning in RL; SFT epoch count interacts differently with off-policy vs. on-policy stages; fine-grained root-cause rewards avoid hacking better than binary classification rewards; and LLM-as-Judge with root-cause analysis is more robust than standard protocols.
Significance. If the empirical isolation of GRPO effects holds, the work supplies actionable guidelines for post-training LLMs on security tasks and identifies concrete failure modes (reward hacking, data-coverage loss) that are likely to recur in other code-analysis domains. The explicit contrast between rationalization and rejection sampling, plus the stage-interaction findings, would be useful to practitioners even if the absolute gains are modest.
major comments (3)
- [§4, §5.1] §4 (Experimental Setup) and §5.1 (Stage Interactions): the central claim that GRPO's gains are attributable to the on-policy RL stage itself requires explicit confirmation that all compared methods (SFT, DPO, etc.) used identical base models, identical training-data distributions after filtering, and identical evaluation protocols. The abstract itself flags method-dependent choices (SFT epoch counts, LLM-as-Judge model, difficulty filtering) that could confound attribution; without matched ablations or a table showing these controls, the outperformance cannot be isolated from data-curation differences.
- [§4.2] §4.2 (Data Curation) and Table 2 (if present): the claim that rationalization introduces hallucinations that degrade SFT is load-bearing for the rejection-sampling recommendation. The manuscript must report quantitative hallucination rates (e.g., via manual audit or LLM-judge agreement on root-cause fidelity) and show that the performance gap survives when rationalization data is cleaned or when difficulty distributions are matched.
- [§5.3] §5.3 (Reward Mechanisms): the assertion that binary classification rewards cause reward hacking while root-cause judgments do not needs a concrete metric (e.g., reward-model accuracy on held-out root-cause labels or divergence between predicted and true vulnerability locations). Without this, the “more reliable credit assignment” claim remains qualitative.
minor comments (2)
- [Abstract] The abstract lists four numbered insights but the corresponding sections in the full text use inconsistent subsection numbering; align the numbering for readability.
- [Figures] Figure captions for training curves should explicitly state the number of random seeds and whether shaded regions represent standard deviation or standard error.
Simulated Author's Rebuttal
We appreciate the referee's detailed feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we will revise the manuscript to include additional controls, metrics, and clarifications to strengthen the attribution of results.
read point-by-point responses
-
Referee: [§4, §5.1] §4 (Experimental Setup) and §5.1 (Stage Interactions): the central claim that GRPO's gains are attributable to the on-policy RL stage itself requires explicit confirmation that all compared methods (SFT, DPO, etc.) used identical base models, identical training-data distributions after filtering, and identical evaluation protocols. The abstract itself flags method-dependent choices (SFT epoch counts, LLM-as-Judge model, difficulty filtering) that could confound attribution; without matched ablations or a table showing these controls, the outperformance cannot be isolated from data-curation differences.
Authors: We agree that explicit controls are necessary to isolate the effects of the on-policy RL stage. In the revised manuscript, we will include a dedicated table that explicitly lists the base model, training data distribution post-filtering, SFT epoch counts, and evaluation protocols for each method. All methods were initialized from the same base model and used the same underlying vulnerability dataset before any method-specific filtering. The method-dependent choices were applied consistently within each pipeline, and we will clarify this to allow verification of the attribution. revision: yes
-
Referee: [§4.2] §4.2 (Data Curation) and Table 2 (if present): the claim that rationalization introduces hallucinations that degrade SFT is load-bearing for the rejection-sampling recommendation. The manuscript must report quantitative hallucination rates (e.g., via manual audit or LLM-judge agreement on root-cause fidelity) and show that the performance gap survives when rationalization data is cleaned or when difficulty distributions are matched.
Authors: We acknowledge the importance of quantifying the hallucination issue. For the revision, we will conduct a manual audit on a sample of rationalized examples to report the hallucination rate. Additionally, we will add an ablation where we clean the rationalization data by removing hallucinated cases and re-evaluate the performance gap to confirm it persists. This will strengthen the recommendation for rejection sampling. revision: yes
-
Referee: [§5.3] §5.3 (Reward Mechanisms): the assertion that binary classification rewards cause reward hacking while root-cause judgments do not needs a concrete metric (e.g., reward-model accuracy on held-out root-cause labels or divergence between predicted and true vulnerability locations). Without this, the “more reliable credit assignment” claim remains qualitative.
Authors: We agree that a concrete metric would make the claim more rigorous. In the revised version, we will report the accuracy of the root-cause reward model on a held-out set of labeled root-cause examples. Additionally, we will measure the divergence between the predicted vulnerability locations in the generated responses and the ground-truth locations from the dataset. This will provide quantitative evidence that root-cause rewards lead to better credit assignment without hacking. revision: yes
Circularity Check
No circularity: purely empirical comparison with no derivations or self-referential reductions
full rationale
The paper is an empirical study reporting experimental outcomes from SFT, preference optimization, and on-policy RL (GRPO) on vulnerability detection tasks. No equations, first-principles derivations, or predictions are claimed; performance differences are attributed to reported training runs, data curation choices, and evaluation protocols. No self-citation load-bearing steps, fitted inputs renamed as predictions, or ansatz smuggling appear. The central claims rest on external experimental results rather than internal definitions, satisfying the self-contained benchmark criterion.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Context: Relevant code such as includes, type definitions, global variables, macros, and definitions of any functions called within the target function
-
[2]
Use all available information to analyze the function step by step
Code: The target function to analyze. Use all available information to analyze the function step by step. If the target function alone is insufficient to determine whether a vulnerability exists, refer to the Context section before making a judgment. Do not assume vulnerabilities — only report what is supported by the code and context. In your final respo...
-
[3]
Task Your task is to generate a Vulnerability Reasoning Trace for a specific code snippet. This reasoning trace will be used to train a student model to detect vulnerabilities. Crucially, you must simulate a "Blind Audit". You are provided with Ground Truth information (Code status, CWE ID, CVE description, commit message, patch diff) to ensure your analy...
-
[4]
Input Data {INPUT} ```Hidden Ground Truth CONFIDENTIAL - FOR TEACHER CONTEXT ONLY The following information is the "Answer Key". Use it to verify which lines are vulnerable and why, but NEVER reference these documents, IDs, or the existence of a patch in your final output. - Code Status: {CODE_STATUS} - CWE ID: {CWE_ID} - CVE Description: {CVE_DESCRIPTION...
-
[5]
Analysis Instructions Step 1: Analyze the Target Code: Examine the Code and its Context. Step 2: Check Code Status: Look at the Code Status provided above. - If PRE-PATCH: Your reasoning must explain how the code enables the specific vulnerability described in the Ground Truth. - If POST-PATCH: Your reasoning must explain why the code is safe, specificall...
-
[6]
Negative Constraints (Strict Adherence Required) - NO mentions of Hidden Ground Truth, such as "CVE", "Commit", "Patch", "Diff", "Fix", or "Description". - NO references to "The provided info" or "Ground truth". - NO phrasing like "As seen in the diff" or "This was later patched". - NO external knowledge hallucination (e.g., do not invent a specific explo...
-
[7]
Output Format In your final response, list all detected vulnerabilities and CWE identifiers if applicable. Conclude with one of the following verdicts on a new line: -HAS_VUL— if any vulnerabilities are found -NO_VUL— if no vulnerabilities are found Figure 16:Prompt for Rationalization-based Data Curation C.2.1. Rejection Sampling Since the teacher LLM’s ...
-
[8]
You must evaluate the analysis against a provided set of groundtruthinformation
Goal Your primary goal is to assess the quality of an analysis of a vulnerable piece of code. You must evaluate the analysis against a provided set of groundtruthinformation. Yourjudgmentmustbeobjective,strictlyadheringtotheprovidedoptionrubricandbasedonlyontheinformationgiven
-
[9]
Input Format You will be provided with a JSON object containing two main keys: analysis and ground_truth_info ```json { "analysis": "<The full analysis, including its reasoning and answer.>", "ground_truth_info": { "is_vulnerable": true, "cve_description": "<The official CVE description of the vulnerability.>", "patch_commit_message": "<The developer’s co...
-
[10]
For each dimension, you need to provide a brief justification and choose an option
Evaluation Workflow and Option Rubric You must follow these steps to evaluate the analysis and produce a final JSON output. For each dimension, you need to provide a brief justification and choose an option. Step 1: Analyze Ground Truth First, carefully review all the information in the ground_truth_info. This is your foundation for judgment. Step 2: Eval...
-
[11]
Output Format Your final output must be a single JSON object. Do not include any text or explanation outside of the JSON structure. The JSON must contain a key for each dimension’s justification and option. ```json { "correctness": { "justification": "<Your brief reason>", "option": <choose from ["CORRECT", "PARTIALLY INCORRECT", "INCORRECT"]> } } ``` Cur...
-
[12]
You must evaluate the analysis against a provided set of ground truth information
Goal Your primary goal is to assess the quality of an analysis of post-patched code in which the target CVE has been fixed. You must evaluate the analysis against a provided set of ground truth information. Your judgment must be objective, strictly adhering to the provided option rubric and based only on the information given
-
[13]
Input Format You will be provided with a JSON object containing two main keys: analysis and ground_truth_info ```json { "analysis": "<The full analysis, including its reasoning and answer.>", "ground_truth_info": { "target_CVE_in_code": false, "cve_description": "<The official CVE description of the vulnerability that was fixed.>", "patch_commit_message":...
-
[14]
For each dimension, you need to provide a brief justification and choose an option
Evaluation Workflow and Option Rubric You must follow these steps to evaluate the analysis and produce a final JSON output. For each dimension, you need to provide a brief justification and choose an option. Step 1: Analyze Ground Truth First, carefully review all the information in the ground_truth_info. This is your foundation for judgment. Step 2: Eval...
-
[15]
Output Format Your final output must be a single JSON object. Do not include any text or explanation outside of the JSON structure. The JSON must contain a key for each dimension’s justification and option. ```json { "correctness": { "justification": "<Your brief reason>", "option": <choose from ["CORRECT", "UNKNOWN", "INCORRECT"]> } } ``` Current Input `...
-
[16]
We need to check if a detector successfully "recalls" (finds) this specific issue
Context We have a code sample with a known vulnerability. We need to check if a detector successfully "recalls" (finds) this specific issue
-
[17]
Input Data - Target Function and Its Contexts:{INPUT} - Code Status: {CODE_STATUS} - CVE Description: {CVE_DESCRIPTION} - Commit Message: {COMMIT_MESSAGE} - Patch Diff: {CODE_DIFF}
-
[18]
Task Generate a JSON Checklist with these specific dimensions: Dimension 1: Verdict_Recall - Create a check: "Does the analysis report contain a vulnerability in the target function that matches the semantic type [INSERT_VULN_TYPE_FROM_CVE]?" - Instruction: Specify that finding this type is sufficient, even if other types are also listed. Dimension 2: Evi...
-
[19]
Output Format (JSON Only) ```json { "phase": "pre_patch", "checklist": [ { "dimension": "Verdict_Recall", "description": "The analysis report includes a finding for the target function that semantically matches [IN- SERT_EXTRACTED_TYPE] (e.g., Integer Overflow)." }, { "dimension": "Evidence_Insecure_Code", "description": "The analysis explicitly cites the...
-
[20]
Context The code isNO_VUL regarding a specific previous vulnerability (Fixed). We need to verify if the detector recognizes this safety or at least does not hallucinate the old bug
-
[21]
Input Data - Target Function and Its Contexts:{INPUT} - Code Status: {CODE_STATUS} - CVE Descriptio: {CVE_DESCRIPTION} - Commit Message: {COMMIT_MESSAGE} - Patch Diff: {CODE_DIFF}
-
[22]
Task Generate a JSON Checklist with these specific dimensions: Dimension 1: Verdict_Absence_of_Specific_Vuln - Reflect the logic: "The analysis is correct if it concludes NO_VUL OR if the detected vulnerabilities DO NOT include [INSERT_OLD_VULN_TYPE]." Dimension 2: Evidence_Safeguard_Code - Identify the new check/sanitizer from the Diff (lines marked`+`)....
-
[23]
Output Format (JSON Only) ```json { "phase": "post_patch", "checklist": [ { "dimension": "Verdict_Absence_of_Specific_Vuln", "description": "The analysis either concludes the function is NO_VUL, OR its list of detected vulnerabilities does NOT include ’[INSERT_OLD_VULN_TYPE]’." }, { "dimension": "Evidence_Safeguard_Code", "description": "The analysis expl...
-
[24]
Goal Your task is to evaluate a vulnerability analysis for a code sample known to beHAS_VUL (vulnerable). Verify if the detector successfully identified the specific target vulnerability described in the Checklist
-
[25]
Scoring Rubric Dimension 1: Verdict (Strict Binary) - CORRECT: The analysis identifies the specific vulnerability type requested in the checklist (e.g., "Buffer Overflow"). Ignore any extra/unrelated vulnerabilities listed by the detector. As long as the target is found, it is true. -INCORRECT:The analysis fails to mention the target vulnerability type. D...
-
[26]
Input The Ground Truth Checklist (Target: Vulnerable): {CHECKLIST} The Candidate Analysis: {ANALYSIS}
-
[27]
Output Format (Json Only) ```json { "Verdict_Recall": { "justification": "<Your brief reason>", "option": <choose from ["CORRECT", "INCORRECT"]> }, "Evidence_Insecure_Code": { "justification": "<Your brief reason>", "option": <choose from ["CORRECT", "PARTIALLY CORRECT", "INCORRECT"]> }, "Reasoning_Mechanism": { "justification": "<Your brief reason>", "op...
-
[28]
Verify if the detector correctly concludes the code is safe regarding the specific vulnerability
Goal Your task is to evaluate a vulnerability analysis for a code sample known to beNO_VUL (vulnerability fixed). Verify if the detector correctly concludes the code is safe regarding the specific vulnerability
-
[29]
The new check prevents the overflow
Scoring Rubric Dimension 1: Verdict (Strict Binary) - CORRECT: The analysis says "NO_VUL" or The analysis reports other vulnerabilities but does not list the specific fixed vulnerability mentioned in the checklist. - INCORRECT: The analysis explicitly claims the specific target vulnerability (e.g., the one described in the checklist) still exists. Dimensi...
-
[30]
Input The Ground Truth Checklist (Target: Safe/Fixed): {CHECKLIST} The Candidate Analysis: {ANALYSIS}
-
[31]
Verdict_Absence_of_Specific_Vuln
Output Format (Json Only) ```json { "Verdict_Absence_of_Specific_Vuln": { "justification": "<Your brief reason>", "option": <choose from ["CORRECT", "INCORRECT"]> }, "Evidence_Safeguard_Code": { "justification": "<Your brief reason>", "option": <choose from ["CORRECT", "PARTIALLY CORRECT", "INCORRECT"]> }, "Reasoning_Resolution": { "justification": "<Your...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.