arxiv: 2602.00513 · v3 · submitted 2026-01-31 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs

Md Tanvirul Alam , Aritran Piplai , Ionut Cardei , Nidhi Rastogi , Peter J Worth Jr

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:20 UTC · model grok-4.3

classification 💻 cs.LG

keywords cyber threat intelligencereinforcement learningverifiable rewardslarge language modelsstructured outputsself-trainingMinervaRLCTI benchmarks

0 comments

The pith

Reinforcement learning guided by deterministic verifiers raises LLM accuracy on cyber threat intelligence tasks by 15.8 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLMs improve at converting unstructured security data into standardized CTI formats when trained with reinforcement learning that uses task-specific verifiers derived from community standards and schemas. These verifiers supply deterministic scores for structured outputs and identifier predictions, replacing the brittleness of supervised fine-tuning. MinervaRL adds a lightweight self-training loop that creates extra verified trajectories to handle sparse rewards during rollout. Averaged results across four model backbones and twelve benchmarks show gains of 15.8 points over base models and 4.3 points over GRPO, indicating that verifiable reward signals can make automated CTI extraction more reliable.

Core claim

MinervaRL is a self-training mechanism that generates additional verified trajectories from task-specific verifiers and distills them back into the model; this process yields an average improvement of 15.8 percentage points over corresponding base models and 4.3 points over GRPO when evaluated across four backbones and twelve CTI benchmarks.

What carries the argument

MinervaRL, the lightweight self-training mechanism that generates and distills verified trajectories to address reward sparsity during reinforcement learning with verifiable rewards.

If this is right

Structured CTI outputs become more reliable under RLVR than under supervised fine-tuning alone.
Canonical identifiers and schemas in CTI resources enable scalable, deterministic reward signals without large new annotation efforts.
Performance gains hold across multiple LLM backbones when the same verifier pipeline is applied.
Self-training on verified trajectories reduces the impact of reward sparsity in rollout phases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same verifier-driven RL approach could transfer to other domains that maintain canonical schemas, such as medical coding or legal document structuring.
As CTI standards evolve, the method may lower the cost of retraining models by relying on updated verifiers rather than full supervised datasets.
Extending verifier coverage to rare edge cases in threat artifacts would likely produce additional measurable gains beyond the reported averages.

Load-bearing premise

The task-specific verifiers must produce accurate, unbiased, and sufficiently dense rewards without systematic errors or coverage gaps that could mislead the reinforcement learning process.

What would settle it

Run the trained models on a fresh set of CTI artifacts and compare outputs against independent expert annotations or alternative verification tools; failure to match the reported gains or consistent verifier-expert disagreement would falsify the improvement claim.

Figures

Figures reproduced from arXiv: 2602.00513 by Aritran Piplai, Ionut Cardei, Md Tanvirul Alam, Nidhi Rastogi, Peter J Worth Jr.

**Figure 1.** Figure 1: MinervaRL overview. strong gains for structured reasoning and generation in LLMs (Shao et al., 2024; DeepSeek-AI et al., 2025; Wen et al., 2025), and GRPO provides a stable, critic-free optimizer for leveraging these signals at scale. 4.2. MinervaRL Minerva-CTI tasks have finite answer spaces (IDs or small sets of IDs), but the effective label space can be large and long-tailed. In our snapshots, for exam… view at source ↗

**Figure 2.** Figure 2: Prompt template used for pairwise response-quality judging (both responses are verifier-correct). Base MinervaRL Sec Primus GRPO Base MinervaRL Sec Primus GRPO Pairwise CTI Preference (A vs B) 0.0 0.2 0.4 0.6 0.8 1.0 P(Model A > Model B) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Pairwise preference heatmap from the CTI judge. Each cell (row A, column B) shows the fraction of comparisons where A is preferred over B (ties excluded). Higher values indicate stronger preference for the row model. Verifier-controlled evaluation set. To avoid conflating quality with correctness, we restrict to prompts for which all compared models are verifier-correct. For each evaluation task we sample … view at source ↗

**Figure 5.** Figure 5: Acceptance rates of the ACR self-distillation pipeline over training for Llama-8B-Instruct: fraction of batches passing the heuristic filter, passing the machine-learning (ML) filter, and meeting unique identifier (UID) coverage (i.e., accepted traces span a minimum number of distinct UIDs). 0 100 200 300 400 500 Step 0 1 2 3 4 Entropy MinervaRL GRPO GRPO - 12 rollout [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗

**Figure 4.** Figure 4: Fraction of training prompts with no successful rollout (max reward = 0) over training steps for Llama-8B-Instruct. Curves are smoothed with a 5-step rolling mean. ablate key MinervaRL components by disabling the EMA teacher (EMA-off) or the filtering pipeline used to select distillation traces (Filtering-Off / MLFilterOff). Sensitivity to the distillation learning-rate scale. MinervaRL performs periodic… view at source ↗

**Figure 7.** Figure 7: Answer-conditioned reasoning (ACR) prompt template used to elicit training traces. ML quality filter and selection. We score the remaining candidates with a lightweight TextCNN classifier that operates on response-only alphanumeric tokens, truncates to at most 1024 tokens, and outputs a “good” probability qi,k ∈ [0, 1]. We keep candidates with qi,k ≥ τq (we use τq = 0.5), yielding the final eligible set Ei… view at source ↗

**Figure 8.** Figure 8: Prompt template used to label reward-correct responses as GOOD/BAD for training the TextCNN reasoning-quality classifier. The judge marks a response as BAD if any single failure criterion is present, including: (1) Leakage (explicitly stating or implying that the answer/label/reference was provided, or copying/paraphrasing the reference), (2) Incoherent (loops, repeated filler, or gibberish), (3) Ungrounde… view at source ↗

read the original abstract

Cyber threat intelligence (CTI) analysts routinely convert noisy, unstructured security artifacts into standardized, automation-ready representations. Although large language models (LLMs) show promise for this task, existing approaches remain brittle when producing structured CTI outputs and have largely relied on supervised fine-tuning (SFT). In contrast, CTI standards and community-maintained resources define canonical identifiers and schemas that enable deterministic verification of model outputs. We leverage this structure to study reinforcement learning with verifiable rewards (RLVR) for CTI tasks. We introduce Minerva, a unified dataset and training pipeline spanning multiple CTI subtasks, each paired with task-specific verifiers that score structured outputs and identifier predictions. To address reward sparsity during rollout, we propose MinervaRL, a lightweight self-training mechanism that generates additional verified trajectories and distills them back into the model. Averaged across four backbones and 12 CTI benchmarks, MinervaRL improves the mean score by 15.8 percentage points over the corresponding base models and by 4.3 points over GRPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MinervaRL reports solid average gains on CTI benchmarks through verifiable rewards and self-training, but the abstract gives almost no evidence on whether the verifiers are reliable enough to support those gains.

read the letter

MinervaRL gets average lifts of 15.8 points over base models and 4.3 points over GRPO across four backbones and twelve CTI benchmarks by training with deterministic rewards from community schemas plus a self-training loop that adds verified trajectories. The core move is to treat CTI subtasks as structured outputs that can be checked exactly rather than relying only on supervised fine-tuning. That fits the domain and produces a practical pipeline with a new unified dataset covering multiple subtasks. The numbers are consistent enough to suggest the method is worth testing in security automation settings. The paper does a clean job framing the problem and showing the empirical delta without overclaiming generality. The soft spots sit in the verifier layer. The abstract states the verifiers are deterministic and tied to standards, yet supplies no numbers on edge-case coverage, partial-match rules, or how often they disagree with human CTI analysts. Because MinervaRL feeds its own verifier-labeled rollouts back into training, any consistent bias or gap in the verifier would get reinforced and could explain part of the reported lift without genuine capability growth. No ablations or statistical controls appear in the provided text, so it is impossible to separate the contribution of the self-training from the base RLVR setup. This work is aimed at applied researchers who need LLMs to produce automation-ready security artifacts rather than at people chasing general RL theory. A reader already working on verifiable reward methods in narrow domains would get concrete implementation ideas and a clear baseline comparison. The paper deserves peer review because it ships a new dataset, a working pipeline, and measurable results on a real task; referees will simply need to press for verifier specifications and controls before the gains can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Minerva, a unified dataset and training pipeline for cyber threat intelligence (CTI) subtasks that pairs each task with deterministic, schema-based verifiers to enable reinforcement learning with verifiable rewards (RLVR). It proposes MinervaRL, a lightweight self-training mechanism that generates additional verified trajectories to mitigate reward sparsity and distills them back into the policy. Averaged across four backbones and 12 CTI benchmarks, the method is reported to improve mean scores by 15.8 percentage points over base models and 4.3 points over GRPO.

Significance. If the verifiers prove reliable and the gains hold under scrutiny, the work would provide concrete evidence that RLVR can improve structured output generation in a high-stakes domain by exploiting community-maintained schemas, offering a scalable path beyond supervised fine-tuning for tasks with verifiable outputs.

major comments (2)

[Abstract] Abstract: the headline gains (15.8 pp over base models, 4.3 pp over GRPO) are presented without any experimental details, ablation studies, verifier definitions, statistical controls, or tables, rendering the central empirical claim impossible to evaluate from the supplied text.
[MinervaRL] MinervaRL self-training description: the claim that distilling verifier-labeled rollouts improves capability rests on the untested assumption that the task-specific verifiers are accurate, unbiased, and free of coverage gaps; no evidence on edge-case handling, partial-match scoring, or human-expert agreement is supplied, which is load-bearing for ruling out reward hacking.

minor comments (1)

The abstract refers to 'four backbones and 12 CTI benchmarks' without naming them, which reduces reproducibility and clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of clarity and empirical rigor that we will address in the revision. We provide point-by-point responses below.

read point-by-point responses

Referee: [Abstract] Abstract: the headline gains (15.8 pp over base models, 4.3 pp over GRPO) are presented without any experimental details, ablation studies, verifier definitions, statistical controls, or tables, rendering the central empirical claim impossible to evaluate from the supplied text.

Authors: We agree that the abstract, constrained by length, presents results at a high level without the supporting details. The full manuscript details the experimental setup (four backbones, 12 benchmarks), verifier definitions, and ablations in Sections 2 and 4, along with the main results table. We will revise the abstract to include a concise reference to the experimental scope (e.g., 'across four backbones and 12 CTI benchmarks') and the core methodological elements, while staying within word limits. This will improve standalone evaluability without duplicating the body of the paper. revision: partial
Referee: [MinervaRL] MinervaRL self-training description: the claim that distilling verifier-labeled rollouts improves capability rests on the untested assumption that the task-specific verifiers are accurate, unbiased, and free of coverage gaps; no evidence on edge-case handling, partial-match scoring, or human-expert agreement is supplied, which is load-bearing for ruling out reward hacking.

Authors: We acknowledge the importance of demonstrating verifier reliability to support the RLVR claims and rule out reward hacking. The verifiers are deterministic and directly implement community CTI schemas and standards as described in Section 2.2, with explicit scoring rules for structured outputs and identifiers. However, the manuscript does not include human-expert agreement metrics or dedicated edge-case analysis. We will add a new subsection (in the experiments) reporting a human validation study on a held-out sample of 200 outputs, measuring inter-annotator agreement with expert CTI analysts, and discussing coverage for partial matches and edge cases. This revision will directly address the concern. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL gains rest on external verifiers and benchmarks

full rationale

The paper reports empirical performance lifts from training LLMs with task-specific deterministic verifiers on CTI schemas. No equations, derivations, or self-referential definitions are presented that reduce the reported scores to fitted parameters or prior outputs by construction. The MinervaRL self-training step generates new trajectories scored by the same external schema verifiers and distills them; this is a standard RL loop whose success is measured on held-out benchmarks rather than being tautological. Self-citations, if present, are not load-bearing for the central claim. The result is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate explicit free parameters, axioms, or invented entities; the approach implicitly assumes standard RL convergence properties and that community CTI standards yield reliable deterministic verifiers.

pith-pipeline@v0.9.0 · 5497 in / 1049 out tokens · 43852 ms · 2026-05-16T09:20:57.359311+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We leverage this structure to study reinforcement learning with verifiable rewards (RLVR) for CTI tasks. ... task-specific verifiers that score structured outputs and identifier predictions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

[1]

Byers, R., Turner, C., and Brewer, T

URL https://www.usenix.org/system/ files/usenixsecurity25-buechel.pdf. Byers, R., Turner, C., and Brewer, T. National Vulnerability Database. https://doi.org/10.18434/M3436,

work page doi:10.18434/m3436
[2]

Reasoning with Exploration: An Entropy Perspective

National Institute of Standards and Technology (NIST). Accessed: 2026-01-19. Center for Threat-Informed Defense. Threat report ATT&CK mapper (TRAM). Project page, 2023. URL https://ctid.mitre.org/projects/ threat-report-attck-mapper-tram/. Center for Threat-Informed Defense. Map- pings Explorer. https://github.com/ center-for-threat-informed-defense/ mapp...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501 2026
[3]

URL https: //arxiv.org/abs/2408.09304

doi: 10.48550/arXiv.2408.09304. URL https: //arxiv.org/abs/2408.09304. Levi, M., Ohayon, D., Blobstein, A., Sagi, R., Molloy, I., and Allouche, Y . Toward cybersecurity-expert small language models.arXiv preprint arXiv:2510.14113,

work page doi:10.48550/arxiv.2408.09304
[4]

URL https: //arxiv.org/abs/2510.14113

doi: 10.48550/arXiv.2510.14113. URL https: //arxiv.org/abs/2510.14113. Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., Li, J. D., Dombrowski, A.-K., Goel, S., Phan, L., et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning.arXiv preprint arXiv:2403.03218, 2024. Liu, X., Tan, Y ., Xiao, Z., Zhuge, J., and Zhou, R. Not...

work page doi:10.48550/arxiv.2510.14113 2024
[5]

Proximal Policy Optimization Algorithms

URL https://tsapps.nist.gov/ publication/get_pdf.cfm?pub_id=958028. Red Canary. Atomic Red Team. https://github. com/redcanaryco/atomic-red-team, 2026. Accessed: 2026-01-19. 10 Minerva Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. doi: 10.48550/ arXiv.170...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

doi: 10.48550/arXiv.2402.03300. SigmaHQ. Sigma Main Rule Repository. https:// github.com/SigmaHQ/sigma, 2026. Accessed: 2026-01-19. Splunk. Splunk Security Content. https://github. com/splunk/security_content, 2026. Ac- cessed: 2026-01-19. Strom, B. E., Applebaum, A., Miller, D. P., Nickels, K. C., Pennington, A. G., and Thomas, C. B. MITRE ATT&CK: Design...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2026
[7]

Given a CVE description as input, the model predicts a single ATT&CK technique identifier (formatted as Txxxx or Txxxx.yyy)

Mappings-Explorer CVE →ATT&CK Exploitation.This task maps vulnerability descriptions to the ATT&CK technique directly used for exploitation. Given a CVE description as input, the model predicts a single ATT&CK technique identifier (formatted as Txxxx or Txxxx.yyy). Ground-truth labels are obtained from the Center for Threat-Informed Defense Mappings Explo...

work page 2026
[8]

Mappings-Explorer CVE→ATT&CK Primary Impact.This task focuses on the main adversarial impact resulting from successful exploitation. The input is the CVE description, and the target is a single ATT&CK technique identifier corresponding to the primary post-exploitation effect (e.g., credential access or privilege escalation). Labels are sourced from the Ma...

work page 2026
[9]

Mappings-Explorer CVE→ATT&CK Secondary Impact.This task predicts a subsequent impact enabled by the primary impact. The input consists of the CVE description concatenated with the given primary-impact technique ID, and the target is a single ATT&CK technique identifier representing the secondary effect. Ground-truth annotations are derived from the Mappin...

work page 2026
[10]

Given a Sigma rule excerpt—including the rule title, logsource, and detection fields—the model predicts a multi-label set of ATT&CK tactic identifiers (formatted as TA000x)

Sigma →ATT&CK Tactics.This task infers high-level adversary intent from detection logic. Given a Sigma rule excerpt—including the rule title, logsource, and detection fields—the model predicts a multi-label set of ATT&CK tactic identifiers (formatted as TA000x). Sigma rules are sourced from the public Sigma repository, and annotations are expressed using ...

work page 2026
[11]

Using the same Sigma rule excerpt as input, the model predicts a single ATT&CK technique identifier (formatted as Txxxx or Txxxx.yyy)

Sigma→ATT&CK Technique.This task maps detection logic to the specific adversarial behavior it is designed to identify. Using the same Sigma rule excerpt as input, the model predicts a single ATT&CK technique identifier (formatted as Txxxx or Txxxx.yyy). Rules are drawn from the Sigma repository, with targets aligned to canonical ATT&CK technique identifie...

work page 2026
[12]

The input is an Atomic Red Team procedure snippet that includes execution steps or commands and platform context, and the target is a single ATT&CK technique identifier

Atomic Red Team→ATT&CK Technique.This task maps adversary procedure descriptions to their corresponding ATT&CK techniques. The input is an Atomic Red Team procedure snippet that includes execution steps or commands and platform context, and the target is a single ATT&CK technique identifier. Examples are drawn from the Atomic Red Team repository, which is...

work page 2026
[13]

The input comprises the rule title, description, and associated KQL query, and the target is a single ATT&CK technique identifier

Microsoft Sentinel →ATT&CK Technique.This task links analytics rules to the ATT&CK techniques they are intended to detect. The input comprises the rule title, description, and associated KQL query, and the target is a single ATT&CK technique identifier. Rules are sourced from the Microsoft Sentinel content repository, with labels normalized to the ATT&CK ...

work page 2026
[14]

The input consists of an SPL query, its detection narrative, and metadata, and the target is a single ATT&CK technique identifier

Splunk Security Content →ATT&CK Technique.This task maps SPL-based detection content to the ATT&CK technique it targets. The input consists of an SPL query, its detection narrative, and metadata, and the target is a single ATT&CK technique identifier. Content is obtained from Splunk Security Content, with annotations expressed using canonical ATT&CK techn...

work page 2026
[15]

The input is a scenario text derived from ATT&CK procedure examples, and the target is a single 12 Minerva Table 6.Minerva training datasets and split sizes

ATT&CK Scenario→Technique.This task identifies the ATT&CK technique that best corresponds to a described adversary scenario. The input is a scenario text derived from ATT&CK procedure examples, and the target is a single 12 Minerva Table 6.Minerva training datasets and split sizes. Dataset Target Train Val Mappings-Explorer CVE→ATT&CK Exploitation ATT&CK ...

work page
[16]

Given the same scenario text as input, the model predicts a multi-label set of ATT&CK tactic identifiers ( TA000x) associated with the underlying behavior

ATT&CK Scenario→Tactics.This task infers the adversary intent categories implied by a scenario. Given the same scenario text as input, the model predicts a multi-label set of ATT&CK tactic identifiers ( TA000x) associated with the underlying behavior. Annotations are derived from the ATT&CK taxonomy and its structured releases (MITRE Corporation, 2026d;a)

work page
[17]

The input is the ATT&CK scenario text, and the target is a multi-label set of ATT&CK mitigation identifiers (Mxxxx) associated with the corresponding techniques

ATT&CK Scenario→Mitigations.This task predicts mitigations relevant to the behaviors described in a scenario. The input is the ATT&CK scenario text, and the target is a multi-label set of ATT&CK mitigation identifiers (Mxxxx) associated with the corresponding techniques. Mitigation mappings are obtained from the ATT&CK knowledge base (MITRE Corporation, 2026d;a)

work page
[18]

The input is the CVE description text, and the target is a multi-label set of CWE identifiers (formatted as CWE-xxx)

NVD CVE→CWE.This task maps vulnerability descriptions to their underlying weakness categories. The input is the CVE description text, and the target is a multi-label set of CWE identifiers (formatted as CWE-xxx). CVE records are obtained from the National Vulnerability Database, with labels aligned to the CWE taxonomy (Byers et al., 2022; MITRE Corporatio...

work page 2022
[19]

Given a CVE de- scription as input, the model outputs the corresponding CVSS v3.1 base vector string (e.g., CVSS:3.1/AV:N/...)

NVD CVE→CVSS v3.1.This task predicts the CVSS v3.1 base vector associated with a vulnerability. Given a CVE de- scription as input, the model outputs the corresponding CVSS v3.1 base vector string (e.g., CVSS:3.1/AV:N/...). CVE entries are sourced from the National Vulnerability Database, and targets follow the official CVSS v3.1 specification (Byers et a...

work page 2022
[20]

A threat actor

ATT&CK Threat Actor Attribution.This task performs threat actor attribution from observed behaviors expressed as procedure text. Examples are derived from MITRE ATT&CK Enterprise intrusion-set (group) entries by collecting each actor’stechniques usedrelationships and extracting the associated procedure descriptions that characterize how the actor operates...

work page
[21]

The input is a CAPEC example description, and the target is a single CAPEC identifier (formatted asCAPEC-xxx)

CAPEC Example→CAPEC.This task maps attack example narratives to the CAPEC attack pattern they exemplify. The input is a CAPEC example description, and the target is a single CAPEC identifier (formatted asCAPEC-xxx). Both example texts and pattern identifiers are sourced from the CAPEC catalog (MITRE Corporation, 2026b)

work page
[22]

given the answer

CAPEC Example →CWE.This task associates CAPEC attack examples with the underlying software weakness categories they exploit. The input is the same CAPEC example narrative, and the target is a multi-label set of CWE identifiers (formatted as CWE-xxx). Example descriptions are drawn from CAPEC, with labels aligned to the CWE taxonomy (MITRE Corporation, 202...

work page
[23]

Leakage: says or implies the answer/label/options/reference were provided, or quotes/paraphrases provided reference text instead of reasoning from the prompt

work page
[24]

Incoherent: loops, repeated phrases/lines, templated filler, or gibberish

work page
[25]

Ungrounded: invents concrete details not in the question (extra CVEs, vendors, malware names, IOCs, dates, techniques, etc.)

work page
[26]

Mismatch: reasoning supports a different label than the final answer, or directly contradicts it

work page
[27]

Unsupported: provides reasoning but does not use evidence from the prompt to justify the answer (hand-wavy or irrelevant justification), OR provides an answer with no reasoning at all (answer-only)

work page
[28]

Other: refusals, policy/meta artifacts, generic CTI tutorials, or prompt copying. Example outputs: {”label”: ”GOOD”} {”label”: ”BAD”, ”categoryid”: 1, ”category title”: ”Leakage”} QUESTION: {QUESTION} RESPONSE: {RESPONSE} Figure 8.Prompt template used to label reward-correct responses as GOOD/BAD for training the TextCNN reasoning-quality classifier. The ...

work page 2025
[29]

Input is a prompt with five options (A–E); output is a single letter A–E on the final line (optional brief justification allowed)

CKT(Alam et al., 2025): Cyber Threat Intelligence multiple-choice QA. Input is a prompt with five options (A–E); output is a single letter A–E on the final line (optional brief justification allowed)

work page 2025
[30]

Input is a prompt with four options (A–D); output is a single letter A–D on the final line

CyberMetric(Tihanyi et al., 2024): general cybersecurity multiple-choice QA. Input is a prompt with four options (A–D); output is a single letter A–D on the final line

work page 2024
[31]

Input includes report context and a question with options; output is a JSON object wrapped in<json object> tags containing a correct answers list

SOCEval(Deason et al., 2025): multi-select reasoning over threat-intelligence reports. Input includes report context and a question with options; output is a JSON object wrapped in<json object> tags containing a correct answers list

work page 2025
[32]

Input is a CVE description; output is a single CWE-####identifier on the final line

RCM(Alam et al., 2025): root-cause mapping from CVE to CWE. Input is a CVE description; output is a single CWE-####identifier on the final line

work page 2025
[33]

Input is a CVE description; output is a full CVSS v3.1 vector string (e.g.,CVSS:3.1/AV:N/...)

VSP(Alam et al., 2025): vulnerability scoring prediction. Input is a CVE description; output is a full CVSS v3.1 vector string (e.g.,CVSS:3.1/AV:N/...)

work page 2025
[34]

Input describes attacker behavior (Windows environment); output is a single technique IDT####orT####.###on the final line

ATE(Alam et al., 2025): ATT&CK technique extraction. Input describes attacker behavior (Windows environment); output is a single technique IDT####orT####.###on the final line. 18 Minerva

work page 2025
[35]

Input describes an attack scenario; output is a set of mitigation IDs M10xx

RMS(Alam et al., 2025): risk mitigation strategy. Input describes an attack scenario; output is a set of mitigation IDs M10xx

work page 2025
[36]

Input contains a detection rule (query + metadata); output is a single ATT&CK technique IDT####

ElasticRule:Elastic rule to ATT&CK technique mapping. Input contains a detection rule (query + metadata); output is a single ATT&CK technique IDT####

work page
[37]

Input is a sentence with entity type definitions; output is a JSON object mapping entity types to extracted entities

APTNER(Wang et al., 2022): APT-focused named-entity recognition. Input is a sentence with entity type definitions; output is a JSON object mapping entity types to extracted entities

work page 2022
[38]

Input is a report segment with a list of candidate indicators; output labels each candidate as IoC vs

LANCE(Froudakis et al., 2025): IoC identification (Prism meta task). Input is a report segment with a list of candidate indicators; output labels each candidate as IoC vs. non-IoC (aggregated over IP/URL/Domain/Hash subtasks)

work page 2025
[39]

This meta task aggregates four subtasks (entity extraction, entity typing, relation existence, relation label) with XML-like tag formats

AnnoCTR(Lange et al., 2024): STIX-style entity and relation extraction. This meta task aggregates four subtasks (entity extraction, entity typing, relation existence, relation label) with XML-like tag formats

work page 2024
[40]

zero-reward

AZERG(Lekssays et al., 2025): STIX-style entity and relation extraction. This meta task aggregates four subtasks (entity extraction, entity typing, relation existence, relation label) with XML-like tag formats. F. Additional Theory: Why MinervaRL Can Expand Empirical Support Setup (answer-level view).Each training instance is (x, a⋆), where x∈ X is the or...

work page 2025
[41]

The number of ACR attempts until the first accepted verified trace is geometric with mean≤1/α

work page
[42]

Consequently, once pθ(a⋆ |x)≥ε k,ζ, the probability that the base RLVR rollouts on x remain all-zero at budget k is at mostζ(Lemma F .1)

AfterN= l logε k,ζ −logp 0 ∆ m successful distillation updates on(x, y acr), we havep θ(a⋆ |x)≥ε k,ζ. Consequently, once pθ(a⋆ |x)≥ε k,ζ, the probability that the base RLVR rollouts on x remain all-zero at budget k is at mostζ(Lemma F .1). Proof sketch.Item (1) is immediate from Assumption F.3. Item (2) follows by iterating Assumption F.4. The final state...

work page 2048