pith. machine review for the scientific record. sign in

arxiv: 2602.14012 · v2 · submitted 2026-02-15 · 💻 cs.CR · cs.AI· cs.SE

Recognition: no theorem link

From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:18 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SE
keywords vulnerability detectionLLM post-trainingsupervised fine-tuningreinforcement learningGRPOreward designsoftware security
0
0 comments X

The pith

On-policy RL with GRPO outperforms SFT and off-policy methods for LLM-based vulnerability detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the full post-training pipeline for adapting large language models to detect software vulnerabilities. It demonstrates that on-policy reinforcement learning using GRPO produces stronger results than supervised fine-tuning, off-policy preference optimization, or existing specialized vulnerability detection models. The work also extracts concrete guidelines on data curation, how training stages interact, reward design, and evaluation methods that differ from standard practices in other domains. These guidelines address issues such as hallucinations from rationalization, skewed difficulty in vulnerability data, reward hacking from simple correctness signals, and the value of root-cause analysis in judging.

Core claim

On-policy RL with GRPO consistently outperforms SFT, off-policy preference optimization methods, and specialized VD LLMs. For data curation, SFT with rejection sampling works better than rationalization-based supervision because the latter introduces hallucinations; in RL, difficulty-aware filtering reduces coverage due to skewed vulnerability difficulty and harms curriculum learning, while pair-based scheduling helps. Increasing SFT epochs benefits off-policy methods but excessive SFT suppresses self-exploration in on-policy RL. Binary correctness rewards cause hacking, whereas fine-grained root-cause judgments and specification-based rewards improve credit assignment and efficiency. LLM-as

What carries the argument

On-policy RL with GRPO, which optimizes the model through self-generated trajectories and relative comparisons within groups of outputs to improve vulnerability detection accuracy.

If this is right

  • Rejection sampling for SFT data is more effective than rationalization because it avoids hallucinations in vulnerability explanations.
  • Pair-based data scheduling outperforms difficulty-aware filtering because the latter shrinks coverage in the skewed vulnerability difficulty distribution.
  • Moderate SFT epochs improve later off-policy optimization while excessive SFT reduces gains from on-policy RL self-exploration.
  • Fine-grained root-cause rewards prevent hacking better than binary classification signals and support more reliable credit assignment.
  • Specification-based rewards increase training efficiency at the cost of extra specification design effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same post-training patterns may transfer to other code-related security tasks such as malware analysis or exploit generation.
  • Automating the generation of specifications could reduce the design cost of specification-based rewards and make them more scalable.
  • Direct comparisons against traditional static analysis tools on the same codebases would clarify whether the LLM gains translate into end-to-end security improvements.

Load-bearing premise

The experiments isolate the effects of each post-training method using matched data, model scale, and evaluation protocols so performance differences reflect the training approach itself.

What would settle it

Re-running the full pipeline on the same base models and vulnerability datasets but finding that SFT or off-policy methods match or exceed GRPO performance when data selection and evaluation are held exactly constant.

Figures

Figures reproduced from arXiv: 2602.14012 by Fuxun Yu, Xinda Wang, Youpeng Li.

Figure 1
Figure 1. Figure 1: presents an overview of post-training pipelines for LLM-based VD, including cold-start SFT (Sec￾tion 3.1), off-policy preference optimization (Section 3.2), and on-policy RL (Section 3.3). This section examines how data curation, stage interactions, reward design, and evaluation protocols dictate the efficacy of model training and assessment [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training Pipeline for SFT 3.1.1. SFT Data Curation High-quality training data is crucial during the cold-start stage to enhance the model’s fundamental VD capabilities. It helps the SFT model better distinguish between vulnerabilities and their corresponding patched versions, thereby improving the effectiveness of subsequent preference optimization and circumventing the sample inefficiency typically trigge… view at source ↗
Figure 3
Figure 3. Figure 3: Training Pipeline for Preference Optimization [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training Pipeline for Reinforcement Learning 3.3.1. RL Data Curation In the on-policy RL stage, the difficulty of the vulnerability queries in the training data determines reward sparsity, which in turn dictates the effectiveness and efficiency of RL training. Particularly in VD, the model’s performance varies significantly across different vulnerability types (e.g., memory-related vs. complex logic vulner… view at source ↗
Figure 5
Figure 5. Figure 5: Difficulty Distribution of 𝒟di f f on SFT LLM Difficulty-aware Data Scheduling. Traditional random sampling often introduces high variance in batch difficulty and creates imbalances between vulnerable and patched samples, potentially desta￾bilizing RL training. To address this, we propose two difficulty-aware data scheduling strategies: • Curriculum Learning: We sort the VD queries by Pairwise Pass@1 in de… view at source ↗
Figure 6
Figure 6. Figure 6: Rubric for Specification-based Reward Assignment [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of SFT Data Curation Methods Rejection sampling has become a common practice for constructing LLM post-training datasets (DeepSeek-AI, 2025, Shao et al., 2024, Liu et al., 2025c, Yu et al., 2025b). However, the difficulty of VD makes it challeng￾ing to sample a sufficient number of correct vulnerability reasoning traces from LLMs. As a result, existing VD work (Wen et al., 2025, Weyssow et al., … view at source ↗
Figure 8
Figure 8. Figure 8: Impact of SFT on Preference Optimization Post-training work in the general domain (Rafailov et al., 2023, Hong et al., 2024, Meng et al., 2024) typically uses lightly trained SFT models (e.g., 1-2 epochs) as the initialization for the subsequent preference optimization stage. This minimizes distribution shift and provides a baseline for distinguishing between preferred and dispreferred responses. However, … view at source ↗
Figure 9
Figure 9. Figure 9: Impact of SFT on On-Policy RL Cold-start SFT is essential for LLM post-training, as seen in models like DeepSeek-R1 (DeepSeek-AI, 2025). It builds the necessary foundations for downstream tasks, increasing the probability of the model generating correct rollouts during the subsequent RL stage and mitigating the impact of sparse rewards on training. However, there are conflicting views between SFT and RL. C… view at source ↗
Figure 10
Figure 10. Figure 10: CVE-2024-46769 - NULL Pointer Dereference Answer-3: Cold-start SFT is essential for GRPO to succeed in VD, but its intensity must be carefully balanced. While moderate SFT provides a necessary foundation, excessive SFT prevents the model from exploring diverse reasoning paths, thereby decreasing P-Pass@1 and P-Pass@8. 5.4. RQ4: How do data filtering and scheduling influence the effectiveness of RL trainin… view at source ↗
Figure 11
Figure 11. Figure 11: Impact of Reward Granularities on RL Training Prior VD work typically evaluates model per￾formance based on the accuracy of binary de￾tection (Weyssow et al., 2025, Simoni et al., 2025) or CWE prediction (Nie et al., 2025, Lipp et al., 2022). This has further inspired subsequent RL-for-VD approaches (Simoni et al., 2025) to use the correctness of binary detection as the reward signal for guiding RL traini… view at source ↗
Figure 12
Figure 12. Figure 12: Model Assessment Shifts Across Multi-Level Evaluation Granularities In this subsection, we conduct a cross￾granularity analysis of 400 randomly selected Qwen3-4B completions to investigate how model assessment shifts as evaluation criteria are refined. We examine three granularities: detection-level binary comparison, prediction￾level CWE match, and reasoning-level root￾cause judgment. As illustrated in … view at source ↗
Figure 13
Figure 13. Figure 13: Distribution of Dataset [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example of Prediction-Level Evaluation Logic for CWE-416 within the CWE-1000 Taxonomy For prediction-based rewarding in RL training and prediction-level matching in model evaluation, we avoid an exact CWE-match approach due to the hierarchical nature of the CWE. Following the mapping methodology in (Lipp et al., 2022, Goseva-Popstojanova and Perhinschi, 2015), a prediction is deemed correct if the predict… view at source ↗
Figure 15
Figure 15. Figure 15: Prompt for LLM-based VD [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt for Rationalization-based Data Curation C.2.1. Rejection Sampling Since the teacher LLM’s input in the rejection sampling-based data curation does not include ground truth vulnerability information, its prompt is identical to the one used for LLM-based VD shown in [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Reasoning-level Judge Prompt for Vulnerable Samples 34 [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Reasoning-level Judge Prompt for Patched Samples 35 [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Specification Generation Prompt for Vulnerable Samples 36 [PITH_FULL_IMAGE:figures/full_fig_p036_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Specification Generation Prompt for Patched Samples C.4.1. Vulnerable Version [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Specification-level Judge Prompt for Vulnerable Samples 39 [PITH_FULL_IMAGE:figures/full_fig_p039_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Specification-level Judge Prompt for Patched Samples 40 [PITH_FULL_IMAGE:figures/full_fig_p040_22.png] view at source ↗
read the original abstract

The integration of LLMs into vulnerability detection (VD) has shifted the field toward more interpretable and context-aware analysis. While post-training techniques have shown promise in general coding tasks, their systematic application to VD remains underexplored. In this paper, we present the first comprehensive investigation into the post-training pipeline for LLM-based VD, demonstrating that on-policy RL with GRPO consistently outperforms SFT, off-policy preference optimization methods, and specialized VD LLMs. Our study further reveals VD-specific post-training guidelines and insights beyond common practices: (1) For data curation, contrary to the widespread use of rationalization-based supervision in prior VD work, SFT based on rejection sampling proves more effective, as rationalization can introduce hallucinations; in RL training, the inherently skewed difficulty distribution of vulnerabilities leads difficulty-aware data filtering to drastically reduce data coverage, causing non-negligible performance loss, and undermines curriculum learning, while pair-based data scheduling can partially mitigate this. (2) For stage interactions, unlike preference optimization typically applied to lightly trained SFT models, increasing SFT epochs consistently benefits off-policy preference optimization in VD tasks; however, excessive SFT suppresses self-exploration in on-policy RL, limiting its gains. (3) For reward mechanisms, naively treating vulnerability classification correctness as reward signals leads to reward hacking, whereas fine-grained root-cause judgments provide more reliable credit assignment; specification-based rewards further improve efficiency at the cost of additional design and generation effort. (4) For evaluation protocols, LLM-as-a-Judge based on root-cause analysis offers a more robust alternative, albeit with variability across judge models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. This paper conducts the first systematic study of the post-training pipeline for LLM-based vulnerability detection. It claims that on-policy RL with GRPO consistently outperforms SFT, off-policy preference optimization, and existing specialized VD LLMs, while deriving four VD-specific insights: rejection sampling outperforms rationalization for SFT data curation; difficulty-aware filtering harms coverage and curriculum learning in RL; SFT epoch count interacts differently with off-policy vs. on-policy stages; fine-grained root-cause rewards avoid hacking better than binary classification rewards; and LLM-as-Judge with root-cause analysis is more robust than standard protocols.

Significance. If the empirical isolation of GRPO effects holds, the work supplies actionable guidelines for post-training LLMs on security tasks and identifies concrete failure modes (reward hacking, data-coverage loss) that are likely to recur in other code-analysis domains. The explicit contrast between rationalization and rejection sampling, plus the stage-interaction findings, would be useful to practitioners even if the absolute gains are modest.

major comments (3)
  1. [§4, §5.1] §4 (Experimental Setup) and §5.1 (Stage Interactions): the central claim that GRPO's gains are attributable to the on-policy RL stage itself requires explicit confirmation that all compared methods (SFT, DPO, etc.) used identical base models, identical training-data distributions after filtering, and identical evaluation protocols. The abstract itself flags method-dependent choices (SFT epoch counts, LLM-as-Judge model, difficulty filtering) that could confound attribution; without matched ablations or a table showing these controls, the outperformance cannot be isolated from data-curation differences.
  2. [§4.2] §4.2 (Data Curation) and Table 2 (if present): the claim that rationalization introduces hallucinations that degrade SFT is load-bearing for the rejection-sampling recommendation. The manuscript must report quantitative hallucination rates (e.g., via manual audit or LLM-judge agreement on root-cause fidelity) and show that the performance gap survives when rationalization data is cleaned or when difficulty distributions are matched.
  3. [§5.3] §5.3 (Reward Mechanisms): the assertion that binary classification rewards cause reward hacking while root-cause judgments do not needs a concrete metric (e.g., reward-model accuracy on held-out root-cause labels or divergence between predicted and true vulnerability locations). Without this, the “more reliable credit assignment” claim remains qualitative.
minor comments (2)
  1. [Abstract] The abstract lists four numbered insights but the corresponding sections in the full text use inconsistent subsection numbering; align the numbering for readability.
  2. [Figures] Figure captions for training curves should explicitly state the number of random seeds and whether shaded regions represent standard deviation or standard error.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we will revise the manuscript to include additional controls, metrics, and clarifications to strengthen the attribution of results.

read point-by-point responses
  1. Referee: [§4, §5.1] §4 (Experimental Setup) and §5.1 (Stage Interactions): the central claim that GRPO's gains are attributable to the on-policy RL stage itself requires explicit confirmation that all compared methods (SFT, DPO, etc.) used identical base models, identical training-data distributions after filtering, and identical evaluation protocols. The abstract itself flags method-dependent choices (SFT epoch counts, LLM-as-Judge model, difficulty filtering) that could confound attribution; without matched ablations or a table showing these controls, the outperformance cannot be isolated from data-curation differences.

    Authors: We agree that explicit controls are necessary to isolate the effects of the on-policy RL stage. In the revised manuscript, we will include a dedicated table that explicitly lists the base model, training data distribution post-filtering, SFT epoch counts, and evaluation protocols for each method. All methods were initialized from the same base model and used the same underlying vulnerability dataset before any method-specific filtering. The method-dependent choices were applied consistently within each pipeline, and we will clarify this to allow verification of the attribution. revision: yes

  2. Referee: [§4.2] §4.2 (Data Curation) and Table 2 (if present): the claim that rationalization introduces hallucinations that degrade SFT is load-bearing for the rejection-sampling recommendation. The manuscript must report quantitative hallucination rates (e.g., via manual audit or LLM-judge agreement on root-cause fidelity) and show that the performance gap survives when rationalization data is cleaned or when difficulty distributions are matched.

    Authors: We acknowledge the importance of quantifying the hallucination issue. For the revision, we will conduct a manual audit on a sample of rationalized examples to report the hallucination rate. Additionally, we will add an ablation where we clean the rationalization data by removing hallucinated cases and re-evaluate the performance gap to confirm it persists. This will strengthen the recommendation for rejection sampling. revision: yes

  3. Referee: [§5.3] §5.3 (Reward Mechanisms): the assertion that binary classification rewards cause reward hacking while root-cause judgments do not needs a concrete metric (e.g., reward-model accuracy on held-out root-cause labels or divergence between predicted and true vulnerability locations). Without this, the “more reliable credit assignment” claim remains qualitative.

    Authors: We agree that a concrete metric would make the claim more rigorous. In the revised version, we will report the accuracy of the root-cause reward model on a held-out set of labeled root-cause examples. Additionally, we will measure the divergence between the predicted vulnerability locations in the generated responses and the ground-truth locations from the dataset. This will provide quantitative evidence that root-cause rewards lead to better credit assignment without hacking. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential reductions

full rationale

The paper is an empirical study reporting experimental outcomes from SFT, preference optimization, and on-policy RL (GRPO) on vulnerability detection tasks. No equations, first-principles derivations, or predictions are claimed; performance differences are attributed to reported training runs, data curation choices, and evaluation protocols. No self-citation load-bearing steps, fitted inputs renamed as predictions, or ansatz smuggling appear. The central claims rest on external experimental results rather than internal definitions, satisfying the self-contained benchmark criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical machine-learning study; no mathematical axioms, free parameters, or invented entities are introduced beyond standard training practices.

pith-pipeline@v0.9.0 · 5603 in / 1072 out tokens · 21204 ms · 2026-05-15T22:18:18.417669+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    Context: Relevant code such as includes, type definitions, global variables, macros, and definitions of any functions called within the target function

  2. [2]

    Use all available information to analyze the function step by step

    Code: The target function to analyze. Use all available information to analyze the function step by step. If the target function alone is insufficient to determine whether a vulnerability exists, refer to the Context section before making a judgment. Do not assume vulnerabilities — only report what is supported by the code and context. In your final respo...

  3. [3]

    Blind Audit

    Task Your task is to generate a Vulnerability Reasoning Trace for a specific code snippet. This reasoning trace will be used to train a student model to detect vulnerabilities. Crucially, you must simulate a "Blind Audit". You are provided with Ground Truth information (Code status, CWE ID, CVE description, commit message, patch diff) to ensure your analy...

  4. [4]

    Answer Key

    Input Data {INPUT} ```Hidden Ground Truth CONFIDENTIAL - FOR TEACHER CONTEXT ONLY The following information is the "Answer Key". Use it to verify which lines are vulnerable and why, but NEVER reference these documents, IDs, or the existence of a patch in your final output. - Code Status: {CODE_STATUS} - CWE ID: {CWE_ID} - CVE Description: {CVE_DESCRIPTION...

  5. [5]

    The patch fixes this by

    Analysis Instructions Step 1: Analyze the Target Code: Examine the Code and its Context. Step 2: Check Code Status: Look at the Code Status provided above. - If PRE-PATCH: Your reasoning must explain how the code enables the specific vulnerability described in the Ground Truth. - If POST-PATCH: Your reasoning must explain why the code is safe, specificall...

  6. [6]

    CVE", "Commit

    Negative Constraints (Strict Adherence Required) - NO mentions of Hidden Ground Truth, such as "CVE", "Commit", "Patch", "Diff", "Fix", or "Description". - NO references to "The provided info" or "Ground truth". - NO phrasing like "As seen in the diff" or "This was later patched". - NO external knowledge hallucination (e.g., do not invent a specific explo...

  7. [7]

    Output Format In your final response, list all detected vulnerabilities and CWE identifiers if applicable. Conclude with one of the following verdicts on a new line: -HAS_VUL— if any vulnerabilities are found -NO_VUL— if no vulnerabilities are found Figure 16:Prompt for Rationalization-based Data Curation C.2.1. Rejection Sampling Since the teacher LLM’s ...

  8. [8]

    You must evaluate the analysis against a provided set of groundtruthinformation

    Goal Your primary goal is to assess the quality of an analysis of a vulnerable piece of code. You must evaluate the analysis against a provided set of groundtruthinformation. Yourjudgmentmustbeobjective,strictlyadheringtotheprovidedoptionrubricandbasedonlyontheinformationgiven

  9. [9]

    analysis

    Input Format You will be provided with a JSON object containing two main keys: analysis and ground_truth_info ```json { "analysis": "<The full analysis, including its reasoning and answer.>", "ground_truth_info": { "is_vulnerable": true, "cve_description": "<The official CVE description of the vulnerability.>", "patch_commit_message": "<The developer’s co...

  10. [10]

    For each dimension, you need to provide a brief justification and choose an option

    Evaluation Workflow and Option Rubric You must follow these steps to evaluate the analysis and produce a final JSON output. For each dimension, you need to provide a brief justification and choose an option. Step 1: Analyze Ground Truth First, carefully review all the information in the ground_truth_info. This is your foundation for judgment. Step 2: Eval...

  11. [11]

    correctness

    Output Format Your final output must be a single JSON object. Do not include any text or explanation outside of the JSON structure. The JSON must contain a key for each dimension’s justification and option. ```json { "correctness": { "justification": "<Your brief reason>", "option": <choose from ["CORRECT", "PARTIALLY INCORRECT", "INCORRECT"]> } } ``` Cur...

  12. [12]

    You must evaluate the analysis against a provided set of ground truth information

    Goal Your primary goal is to assess the quality of an analysis of post-patched code in which the target CVE has been fixed. You must evaluate the analysis against a provided set of ground truth information. Your judgment must be objective, strictly adhering to the provided option rubric and based only on the information given

  13. [13]

    analysis

    Input Format You will be provided with a JSON object containing two main keys: analysis and ground_truth_info ```json { "analysis": "<The full analysis, including its reasoning and answer.>", "ground_truth_info": { "target_CVE_in_code": false, "cve_description": "<The official CVE description of the vulnerability that was fixed.>", "patch_commit_message":...

  14. [14]

    For each dimension, you need to provide a brief justification and choose an option

    Evaluation Workflow and Option Rubric You must follow these steps to evaluate the analysis and produce a final JSON output. For each dimension, you need to provide a brief justification and choose an option. Step 1: Analyze Ground Truth First, carefully review all the information in the ground_truth_info. This is your foundation for judgment. Step 2: Eval...

  15. [15]

    correctness

    Output Format Your final output must be a single JSON object. Do not include any text or explanation outside of the JSON structure. The JSON must contain a key for each dimension’s justification and option. ```json { "correctness": { "justification": "<Your brief reason>", "option": <choose from ["CORRECT", "UNKNOWN", "INCORRECT"]> } } ``` Current Input `...

  16. [16]

    We need to check if a detector successfully "recalls" (finds) this specific issue

    Context We have a code sample with a known vulnerability. We need to check if a detector successfully "recalls" (finds) this specific issue

  17. [17]

    Input Data - Target Function and Its Contexts:{INPUT} - Code Status: {CODE_STATUS} - CVE Description: {CVE_DESCRIPTION} - Commit Message: {COMMIT_MESSAGE} - Patch Diff: {CODE_DIFF}

  18. [18]

    Does the analysis report contain a vulnerability in the target function that matches the semantic type [INSERT_VULN_TYPE_FROM_CVE]?

    Task Generate a JSON Checklist with these specific dimensions: Dimension 1: Verdict_Recall - Create a check: "Does the analysis report contain a vulnerability in the target function that matches the semantic type [INSERT_VULN_TYPE_FROM_CVE]?" - Instruction: Specify that finding this type is sufficient, even if other types are also listed. Dimension 2: Evi...

  19. [19]

    phase":

    Output Format (JSON Only) ```json { "phase": "pre_patch", "checklist": [ { "dimension": "Verdict_Recall", "description": "The analysis report includes a finding for the target function that semantically matches [IN- SERT_EXTRACTED_TYPE] (e.g., Integer Overflow)." }, { "dimension": "Evidence_Insecure_Code", "description": "The analysis explicitly cites the...

  20. [20]

    We need to verify if the detector recognizes this safety or at least does not hallucinate the old bug

    Context The code isNO_VUL regarding a specific previous vulnerability (Fixed). We need to verify if the detector recognizes this safety or at least does not hallucinate the old bug

  21. [21]

    Input Data - Target Function and Its Contexts:{INPUT} - Code Status: {CODE_STATUS} - CVE Descriptio: {CVE_DESCRIPTION} - Commit Message: {COMMIT_MESSAGE} - Patch Diff: {CODE_DIFF}

  22. [22]

    The analysis is correct if it concludes NO_VUL OR if the detected vulnerabilities DO NOT include [INSERT_OLD_VULN_TYPE]

    Task Generate a JSON Checklist with these specific dimensions: Dimension 1: Verdict_Absence_of_Specific_Vuln - Reflect the logic: "The analysis is correct if it concludes NO_VUL OR if the detected vulnerabilities DO NOT include [INSERT_OLD_VULN_TYPE]." Dimension 2: Evidence_Safeguard_Code - Identify the new check/sanitizer from the Diff (lines marked`+`)....

  23. [23]

    phase":

    Output Format (JSON Only) ```json { "phase": "post_patch", "checklist": [ { "dimension": "Verdict_Absence_of_Specific_Vuln", "description": "The analysis either concludes the function is NO_VUL, OR its list of detected vulnerabilities does NOT include ’[INSERT_OLD_VULN_TYPE]’." }, { "dimension": "Evidence_Safeguard_Code", "description": "The analysis expl...

  24. [24]

    Verify if the detector successfully identified the specific target vulnerability described in the Checklist

    Goal Your task is to evaluate a vulnerability analysis for a code sample known to beHAS_VUL (vulnerable). Verify if the detector successfully identified the specific target vulnerability described in the Checklist

  25. [25]

    Buffer Overflow

    Scoring Rubric Dimension 1: Verdict (Strict Binary) - CORRECT: The analysis identifies the specific vulnerability type requested in the checklist (e.g., "Buffer Overflow"). Ignore any extra/unrelated vulnerabilities listed by the detector. As long as the target is found, it is true. -INCORRECT:The analysis fails to mention the target vulnerability type. D...

  26. [26]

    Input The Ground Truth Checklist (Target: Vulnerable): {CHECKLIST} The Candidate Analysis: {ANALYSIS}

  27. [27]

    Verdict_Recall

    Output Format (Json Only) ```json { "Verdict_Recall": { "justification": "<Your brief reason>", "option": <choose from ["CORRECT", "INCORRECT"]> }, "Evidence_Insecure_Code": { "justification": "<Your brief reason>", "option": <choose from ["CORRECT", "PARTIALLY CORRECT", "INCORRECT"]> }, "Reasoning_Mechanism": { "justification": "<Your brief reason>", "op...

  28. [28]

    Verify if the detector correctly concludes the code is safe regarding the specific vulnerability

    Goal Your task is to evaluate a vulnerability analysis for a code sample known to beNO_VUL (vulnerability fixed). Verify if the detector correctly concludes the code is safe regarding the specific vulnerability

  29. [29]

    The new check prevents the overflow

    Scoring Rubric Dimension 1: Verdict (Strict Binary) - CORRECT: The analysis says "NO_VUL" or The analysis reports other vulnerabilities but does not list the specific fixed vulnerability mentioned in the checklist. - INCORRECT: The analysis explicitly claims the specific target vulnerability (e.g., the one described in the checklist) still exists. Dimensi...

  30. [30]

    Input The Ground Truth Checklist (Target: Safe/Fixed): {CHECKLIST} The Candidate Analysis: {ANALYSIS}

  31. [31]

    Verdict_Absence_of_Specific_Vuln

    Output Format (Json Only) ```json { "Verdict_Absence_of_Specific_Vuln": { "justification": "<Your brief reason>", "option": <choose from ["CORRECT", "INCORRECT"]> }, "Evidence_Safeguard_Code": { "justification": "<Your brief reason>", "option": <choose from ["CORRECT", "PARTIALLY CORRECT", "INCORRECT"]> }, "Reasoning_Resolution": { "justification": "<Your...