pith. machine review for the scientific record. sign in

arxiv: 2602.00513 · v3 · submitted 2026-01-31 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:20 UTC · model grok-4.3

classification 💻 cs.LG
keywords cyber threat intelligencereinforcement learningverifiable rewardslarge language modelsstructured outputsself-trainingMinervaRLCTI benchmarks
0
0 comments X

The pith

Reinforcement learning guided by deterministic verifiers raises LLM accuracy on cyber threat intelligence tasks by 15.8 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLMs improve at converting unstructured security data into standardized CTI formats when trained with reinforcement learning that uses task-specific verifiers derived from community standards and schemas. These verifiers supply deterministic scores for structured outputs and identifier predictions, replacing the brittleness of supervised fine-tuning. MinervaRL adds a lightweight self-training loop that creates extra verified trajectories to handle sparse rewards during rollout. Averaged results across four model backbones and twelve benchmarks show gains of 15.8 points over base models and 4.3 points over GRPO, indicating that verifiable reward signals can make automated CTI extraction more reliable.

Core claim

MinervaRL is a self-training mechanism that generates additional verified trajectories from task-specific verifiers and distills them back into the model; this process yields an average improvement of 15.8 percentage points over corresponding base models and 4.3 points over GRPO when evaluated across four backbones and twelve CTI benchmarks.

What carries the argument

MinervaRL, the lightweight self-training mechanism that generates and distills verified trajectories to address reward sparsity during reinforcement learning with verifiable rewards.

If this is right

  • Structured CTI outputs become more reliable under RLVR than under supervised fine-tuning alone.
  • Canonical identifiers and schemas in CTI resources enable scalable, deterministic reward signals without large new annotation efforts.
  • Performance gains hold across multiple LLM backbones when the same verifier pipeline is applied.
  • Self-training on verified trajectories reduces the impact of reward sparsity in rollout phases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same verifier-driven RL approach could transfer to other domains that maintain canonical schemas, such as medical coding or legal document structuring.
  • As CTI standards evolve, the method may lower the cost of retraining models by relying on updated verifiers rather than full supervised datasets.
  • Extending verifier coverage to rare edge cases in threat artifacts would likely produce additional measurable gains beyond the reported averages.

Load-bearing premise

The task-specific verifiers must produce accurate, unbiased, and sufficiently dense rewards without systematic errors or coverage gaps that could mislead the reinforcement learning process.

What would settle it

Run the trained models on a fresh set of CTI artifacts and compare outputs against independent expert annotations or alternative verification tools; failure to match the reported gains or consistent verifier-expert disagreement would falsify the improvement claim.

Figures

Figures reproduced from arXiv: 2602.00513 by Aritran Piplai, Ionut Cardei, Md Tanvirul Alam, Nidhi Rastogi, Peter J Worth Jr.

Figure 1
Figure 1. Figure 1: MinervaRL overview. strong gains for structured reasoning and generation in LLMs (Shao et al., 2024; DeepSeek-AI et al., 2025; Wen et al., 2025), and GRPO provides a stable, critic-free opti￾mizer for leveraging these signals at scale. 4.2. MinervaRL Minerva-CTI tasks have finite answer spaces (IDs or small sets of IDs), but the effective label space can be large and long-tailed. In our snapshots, for exam… view at source ↗
Figure 2
Figure 2. Figure 2: Prompt template used for pairwise response-quality judging (both responses are verifier-correct). Base MinervaRL Sec Primus GRPO Base MinervaRL Sec Primus GRPO Pairwise CTI Preference (A vs B) 0.0 0.2 0.4 0.6 0.8 1.0 P(Model A > Model B) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pairwise preference heatmap from the CTI judge. Each cell (row A, column B) shows the fraction of comparisons where A is preferred over B (ties excluded). Higher values indicate stronger preference for the row model. Verifier-controlled evaluation set. To avoid conflating quality with correctness, we restrict to prompts for which all compared models are verifier-correct. For each evaluation task we sample … view at source ↗
Figure 5
Figure 5. Figure 5: Acceptance rates of the ACR self-distillation pipeline over training for Llama-8B-Instruct: fraction of batches passing the heuristic filter, passing the machine-learning (ML) filter, and meeting unique identifier (UID) coverage (i.e., accepted traces span a minimum number of distinct UIDs). 0 100 200 300 400 500 Step 0 1 2 3 4 Entropy MinervaRL GRPO GRPO - 12 rollout [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗
Figure 4
Figure 4. Figure 4: Fraction of training prompts with no successful roll￾out (max reward = 0) over training steps for Llama-8B-Instruct. Curves are smoothed with a 5-step rolling mean. ablate key MinervaRL components by disabling the EMA teacher (EMA-off) or the filtering pipeline used to select distillation traces (Filtering-Off / MLFilterOff). Sensitivity to the distillation learning-rate scale. Min￾ervaRL performs periodic… view at source ↗
Figure 7
Figure 7. Figure 7: Answer-conditioned reasoning (ACR) prompt template used to elicit training traces. ML quality filter and selection. We score the remaining candidates with a lightweight TextCNN classifier that operates on response-only alphanumeric tokens, truncates to at most 1024 tokens, and outputs a “good” probability qi,k ∈ [0, 1]. We keep candidates with qi,k ≥ τq (we use τq = 0.5), yielding the final eligible set Ei… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template used to label reward-correct responses as GOOD/BAD for training the TextCNN reasoning-quality classifier. The judge marks a response as BAD if any single failure criterion is present, including: (1) Leakage (explicitly stating or implying that the answer/label/reference was provided, or copying/paraphrasing the reference), (2) Incoherent (loops, repeated filler, or gibberish), (3) Ungrounde… view at source ↗
read the original abstract

Cyber threat intelligence (CTI) analysts routinely convert noisy, unstructured security artifacts into standardized, automation-ready representations. Although large language models (LLMs) show promise for this task, existing approaches remain brittle when producing structured CTI outputs and have largely relied on supervised fine-tuning (SFT). In contrast, CTI standards and community-maintained resources define canonical identifiers and schemas that enable deterministic verification of model outputs. We leverage this structure to study reinforcement learning with verifiable rewards (RLVR) for CTI tasks. We introduce Minerva, a unified dataset and training pipeline spanning multiple CTI subtasks, each paired with task-specific verifiers that score structured outputs and identifier predictions. To address reward sparsity during rollout, we propose MinervaRL, a lightweight self-training mechanism that generates additional verified trajectories and distills them back into the model. Averaged across four backbones and 12 CTI benchmarks, MinervaRL improves the mean score by 15.8 percentage points over the corresponding base models and by 4.3 points over GRPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Minerva, a unified dataset and training pipeline for cyber threat intelligence (CTI) subtasks that pairs each task with deterministic, schema-based verifiers to enable reinforcement learning with verifiable rewards (RLVR). It proposes MinervaRL, a lightweight self-training mechanism that generates additional verified trajectories to mitigate reward sparsity and distills them back into the policy. Averaged across four backbones and 12 CTI benchmarks, the method is reported to improve mean scores by 15.8 percentage points over base models and 4.3 points over GRPO.

Significance. If the verifiers prove reliable and the gains hold under scrutiny, the work would provide concrete evidence that RLVR can improve structured output generation in a high-stakes domain by exploiting community-maintained schemas, offering a scalable path beyond supervised fine-tuning for tasks with verifiable outputs.

major comments (2)
  1. [Abstract] Abstract: the headline gains (15.8 pp over base models, 4.3 pp over GRPO) are presented without any experimental details, ablation studies, verifier definitions, statistical controls, or tables, rendering the central empirical claim impossible to evaluate from the supplied text.
  2. [MinervaRL] MinervaRL self-training description: the claim that distilling verifier-labeled rollouts improves capability rests on the untested assumption that the task-specific verifiers are accurate, unbiased, and free of coverage gaps; no evidence on edge-case handling, partial-match scoring, or human-expert agreement is supplied, which is load-bearing for ruling out reward hacking.
minor comments (1)
  1. The abstract refers to 'four backbones and 12 CTI benchmarks' without naming them, which reduces reproducibility and clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of clarity and empirical rigor that we will address in the revision. We provide point-by-point responses below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline gains (15.8 pp over base models, 4.3 pp over GRPO) are presented without any experimental details, ablation studies, verifier definitions, statistical controls, or tables, rendering the central empirical claim impossible to evaluate from the supplied text.

    Authors: We agree that the abstract, constrained by length, presents results at a high level without the supporting details. The full manuscript details the experimental setup (four backbones, 12 benchmarks), verifier definitions, and ablations in Sections 2 and 4, along with the main results table. We will revise the abstract to include a concise reference to the experimental scope (e.g., 'across four backbones and 12 CTI benchmarks') and the core methodological elements, while staying within word limits. This will improve standalone evaluability without duplicating the body of the paper. revision: partial

  2. Referee: [MinervaRL] MinervaRL self-training description: the claim that distilling verifier-labeled rollouts improves capability rests on the untested assumption that the task-specific verifiers are accurate, unbiased, and free of coverage gaps; no evidence on edge-case handling, partial-match scoring, or human-expert agreement is supplied, which is load-bearing for ruling out reward hacking.

    Authors: We acknowledge the importance of demonstrating verifier reliability to support the RLVR claims and rule out reward hacking. The verifiers are deterministic and directly implement community CTI schemas and standards as described in Section 2.2, with explicit scoring rules for structured outputs and identifiers. However, the manuscript does not include human-expert agreement metrics or dedicated edge-case analysis. We will add a new subsection (in the experiments) reporting a human validation study on a held-out sample of 200 outputs, measuring inter-annotator agreement with expert CTI analysts, and discussing coverage for partial matches and edge cases. This revision will directly address the concern. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL gains rest on external verifiers and benchmarks

full rationale

The paper reports empirical performance lifts from training LLMs with task-specific deterministic verifiers on CTI schemas. No equations, derivations, or self-referential definitions are presented that reduce the reported scores to fitted parameters or prior outputs by construction. The MinervaRL self-training step generates new trajectories scored by the same external schema verifiers and distills them; this is a standard RL loop whose success is measured on held-out benchmarks rather than being tautological. Self-citations, if present, are not load-bearing for the central claim. The result is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate explicit free parameters, axioms, or invented entities; the approach implicitly assumes standard RL convergence properties and that community CTI standards yield reliable deterministic verifiers.

pith-pipeline@v0.9.0 · 5497 in / 1049 out tokens · 43852 ms · 2026-05-16T09:20:57.359311+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

  1. [1]

    Byers, R., Turner, C., and Brewer, T

    URL https://www.usenix.org/system/ files/usenixsecurity25-buechel.pdf. Byers, R., Turner, C., and Brewer, T. National Vulnerability Database. https://doi.org/10.18434/M3436,

  2. [2]

    Reasoning with Exploration: An Entropy Perspective

    National Institute of Standards and Technology (NIST). Accessed: 2026-01-19. Center for Threat-Informed Defense. Threat report ATT&CK mapper (TRAM). Project page, 2023. URL https://ctid.mitre.org/projects/ threat-report-attck-mapper-tram/. Center for Threat-Informed Defense. Map- pings Explorer. https://github.com/ center-for-threat-informed-defense/ mapp...

  3. [3]

    URL https: //arxiv.org/abs/2408.09304

    doi: 10.48550/arXiv.2408.09304. URL https: //arxiv.org/abs/2408.09304. Levi, M., Ohayon, D., Blobstein, A., Sagi, R., Molloy, I., and Allouche, Y . Toward cybersecurity-expert small language models.arXiv preprint arXiv:2510.14113,

  4. [4]

    URL https: //arxiv.org/abs/2510.14113

    doi: 10.48550/arXiv.2510.14113. URL https: //arxiv.org/abs/2510.14113. Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., Li, J. D., Dombrowski, A.-K., Goel, S., Phan, L., et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning.arXiv preprint arXiv:2403.03218, 2024. Liu, X., Tan, Y ., Xiao, Z., Zhuge, J., and Zhou, R. Not...

  5. [5]

    Proximal Policy Optimization Algorithms

    URL https://tsapps.nist.gov/ publication/get_pdf.cfm?pub_id=958028. Red Canary. Atomic Red Team. https://github. com/redcanaryco/atomic-red-team, 2026. Accessed: 2026-01-19. 10 Minerva Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. doi: 10.48550/ arXiv.170...

  6. [6]

    doi: 10.48550/arXiv.2402.03300. SigmaHQ. Sigma Main Rule Repository. https:// github.com/SigmaHQ/sigma, 2026. Accessed: 2026-01-19. Splunk. Splunk Security Content. https://github. com/splunk/security_content, 2026. Ac- cessed: 2026-01-19. Strom, B. E., Applebaum, A., Miller, D. P., Nickels, K. C., Pennington, A. G., and Thomas, C. B. MITRE ATT&CK: Design...

  7. [7]

    Given a CVE description as input, the model predicts a single ATT&CK technique identifier (formatted as Txxxx or Txxxx.yyy)

    Mappings-Explorer CVE →ATT&CK Exploitation.This task maps vulnerability descriptions to the ATT&CK technique directly used for exploitation. Given a CVE description as input, the model predicts a single ATT&CK technique identifier (formatted as Txxxx or Txxxx.yyy). Ground-truth labels are obtained from the Center for Threat-Informed Defense Mappings Explo...

  8. [8]

    Mappings-Explorer CVE→ATT&CK Primary Impact.This task focuses on the main adversarial impact resulting from successful exploitation. The input is the CVE description, and the target is a single ATT&CK technique identifier corresponding to the primary post-exploitation effect (e.g., credential access or privilege escalation). Labels are sourced from the Ma...

  9. [9]

    Mappings-Explorer CVE→ATT&CK Secondary Impact.This task predicts a subsequent impact enabled by the primary impact. The input consists of the CVE description concatenated with the given primary-impact technique ID, and the target is a single ATT&CK technique identifier representing the secondary effect. Ground-truth annotations are derived from the Mappin...

  10. [10]

    Given a Sigma rule excerpt—including the rule title, logsource, and detection fields—the model predicts a multi-label set of ATT&CK tactic identifiers (formatted as TA000x)

    Sigma →ATT&CK Tactics.This task infers high-level adversary intent from detection logic. Given a Sigma rule excerpt—including the rule title, logsource, and detection fields—the model predicts a multi-label set of ATT&CK tactic identifiers (formatted as TA000x). Sigma rules are sourced from the public Sigma repository, and annotations are expressed using ...

  11. [11]

    Using the same Sigma rule excerpt as input, the model predicts a single ATT&CK technique identifier (formatted as Txxxx or Txxxx.yyy)

    Sigma→ATT&CK Technique.This task maps detection logic to the specific adversarial behavior it is designed to identify. Using the same Sigma rule excerpt as input, the model predicts a single ATT&CK technique identifier (formatted as Txxxx or Txxxx.yyy). Rules are drawn from the Sigma repository, with targets aligned to canonical ATT&CK technique identifie...

  12. [12]

    The input is an Atomic Red Team procedure snippet that includes execution steps or commands and platform context, and the target is a single ATT&CK technique identifier

    Atomic Red Team→ATT&CK Technique.This task maps adversary procedure descriptions to their corresponding ATT&CK techniques. The input is an Atomic Red Team procedure snippet that includes execution steps or commands and platform context, and the target is a single ATT&CK technique identifier. Examples are drawn from the Atomic Red Team repository, which is...

  13. [13]

    The input comprises the rule title, description, and associated KQL query, and the target is a single ATT&CK technique identifier

    Microsoft Sentinel →ATT&CK Technique.This task links analytics rules to the ATT&CK techniques they are intended to detect. The input comprises the rule title, description, and associated KQL query, and the target is a single ATT&CK technique identifier. Rules are sourced from the Microsoft Sentinel content repository, with labels normalized to the ATT&CK ...

  14. [14]

    The input consists of an SPL query, its detection narrative, and metadata, and the target is a single ATT&CK technique identifier

    Splunk Security Content →ATT&CK Technique.This task maps SPL-based detection content to the ATT&CK technique it targets. The input consists of an SPL query, its detection narrative, and metadata, and the target is a single ATT&CK technique identifier. Content is obtained from Splunk Security Content, with annotations expressed using canonical ATT&CK techn...

  15. [15]

    The input is a scenario text derived from ATT&CK procedure examples, and the target is a single 12 Minerva Table 6.Minerva training datasets and split sizes

    ATT&CK Scenario→Technique.This task identifies the ATT&CK technique that best corresponds to a described adversary scenario. The input is a scenario text derived from ATT&CK procedure examples, and the target is a single 12 Minerva Table 6.Minerva training datasets and split sizes. Dataset Target Train Val Mappings-Explorer CVE→ATT&CK Exploitation ATT&CK ...

  16. [16]

    Given the same scenario text as input, the model predicts a multi-label set of ATT&CK tactic identifiers ( TA000x) associated with the underlying behavior

    ATT&CK Scenario→Tactics.This task infers the adversary intent categories implied by a scenario. Given the same scenario text as input, the model predicts a multi-label set of ATT&CK tactic identifiers ( TA000x) associated with the underlying behavior. Annotations are derived from the ATT&CK taxonomy and its structured releases (MITRE Corporation, 2026d;a)

  17. [17]

    The input is the ATT&CK scenario text, and the target is a multi-label set of ATT&CK mitigation identifiers (Mxxxx) associated with the corresponding techniques

    ATT&CK Scenario→Mitigations.This task predicts mitigations relevant to the behaviors described in a scenario. The input is the ATT&CK scenario text, and the target is a multi-label set of ATT&CK mitigation identifiers (Mxxxx) associated with the corresponding techniques. Mitigation mappings are obtained from the ATT&CK knowledge base (MITRE Corporation, 2026d;a)

  18. [18]

    The input is the CVE description text, and the target is a multi-label set of CWE identifiers (formatted as CWE-xxx)

    NVD CVE→CWE.This task maps vulnerability descriptions to their underlying weakness categories. The input is the CVE description text, and the target is a multi-label set of CWE identifiers (formatted as CWE-xxx). CVE records are obtained from the National Vulnerability Database, with labels aligned to the CWE taxonomy (Byers et al., 2022; MITRE Corporatio...

  19. [19]

    Given a CVE de- scription as input, the model outputs the corresponding CVSS v3.1 base vector string (e.g., CVSS:3.1/AV:N/...)

    NVD CVE→CVSS v3.1.This task predicts the CVSS v3.1 base vector associated with a vulnerability. Given a CVE de- scription as input, the model outputs the corresponding CVSS v3.1 base vector string (e.g., CVSS:3.1/AV:N/...). CVE entries are sourced from the National Vulnerability Database, and targets follow the official CVSS v3.1 specification (Byers et a...

  20. [20]

    A threat actor

    ATT&CK Threat Actor Attribution.This task performs threat actor attribution from observed behaviors expressed as procedure text. Examples are derived from MITRE ATT&CK Enterprise intrusion-set (group) entries by collecting each actor’stechniques usedrelationships and extracting the associated procedure descriptions that characterize how the actor operates...

  21. [21]

    The input is a CAPEC example description, and the target is a single CAPEC identifier (formatted asCAPEC-xxx)

    CAPEC Example→CAPEC.This task maps attack example narratives to the CAPEC attack pattern they exemplify. The input is a CAPEC example description, and the target is a single CAPEC identifier (formatted asCAPEC-xxx). Both example texts and pattern identifiers are sourced from the CAPEC catalog (MITRE Corporation, 2026b)

  22. [22]

    given the answer

    CAPEC Example →CWE.This task associates CAPEC attack examples with the underlying software weakness categories they exploit. The input is the same CAPEC example narrative, and the target is a multi-label set of CWE identifiers (formatted as CWE-xxx). Example descriptions are drawn from CAPEC, with labels aligned to the CWE taxonomy (MITRE Corporation, 202...

  23. [23]

    Leakage: says or implies the answer/label/options/reference were provided, or quotes/paraphrases provided reference text instead of reasoning from the prompt

  24. [24]

    Incoherent: loops, repeated phrases/lines, templated filler, or gibberish

  25. [25]

    Ungrounded: invents concrete details not in the question (extra CVEs, vendors, malware names, IOCs, dates, techniques, etc.)

  26. [26]

    Mismatch: reasoning supports a different label than the final answer, or directly contradicts it

  27. [27]

    Unsupported: provides reasoning but does not use evidence from the prompt to justify the answer (hand-wavy or irrelevant justification), OR provides an answer with no reasoning at all (answer-only)

  28. [28]

    Other: refusals, policy/meta artifacts, generic CTI tutorials, or prompt copying. Example outputs: {”label”: ”GOOD”} {”label”: ”BAD”, ”categoryid”: 1, ”category title”: ”Leakage”} QUESTION: {QUESTION} RESPONSE: {RESPONSE} Figure 8.Prompt template used to label reward-correct responses as GOOD/BAD for training the TextCNN reasoning-quality classifier. The ...

  29. [29]

    Input is a prompt with five options (A–E); output is a single letter A–E on the final line (optional brief justification allowed)

    CKT(Alam et al., 2025): Cyber Threat Intelligence multiple-choice QA. Input is a prompt with five options (A–E); output is a single letter A–E on the final line (optional brief justification allowed)

  30. [30]

    Input is a prompt with four options (A–D); output is a single letter A–D on the final line

    CyberMetric(Tihanyi et al., 2024): general cybersecurity multiple-choice QA. Input is a prompt with four options (A–D); output is a single letter A–D on the final line

  31. [31]

    Input includes report context and a question with options; output is a JSON object wrapped in<json object> tags containing a correct answers list

    SOCEval(Deason et al., 2025): multi-select reasoning over threat-intelligence reports. Input includes report context and a question with options; output is a JSON object wrapped in<json object> tags containing a correct answers list

  32. [32]

    Input is a CVE description; output is a single CWE-####identifier on the final line

    RCM(Alam et al., 2025): root-cause mapping from CVE to CWE. Input is a CVE description; output is a single CWE-####identifier on the final line

  33. [33]

    Input is a CVE description; output is a full CVSS v3.1 vector string (e.g.,CVSS:3.1/AV:N/...)

    VSP(Alam et al., 2025): vulnerability scoring prediction. Input is a CVE description; output is a full CVSS v3.1 vector string (e.g.,CVSS:3.1/AV:N/...)

  34. [34]

    Input describes attacker behavior (Windows environment); output is a single technique IDT####orT####.###on the final line

    ATE(Alam et al., 2025): ATT&CK technique extraction. Input describes attacker behavior (Windows environment); output is a single technique IDT####orT####.###on the final line. 18 Minerva

  35. [35]

    Input describes an attack scenario; output is a set of mitigation IDs M10xx

    RMS(Alam et al., 2025): risk mitigation strategy. Input describes an attack scenario; output is a set of mitigation IDs M10xx

  36. [36]

    Input contains a detection rule (query + metadata); output is a single ATT&CK technique IDT####

    ElasticRule:Elastic rule to ATT&CK technique mapping. Input contains a detection rule (query + metadata); output is a single ATT&CK technique IDT####

  37. [37]

    Input is a sentence with entity type definitions; output is a JSON object mapping entity types to extracted entities

    APTNER(Wang et al., 2022): APT-focused named-entity recognition. Input is a sentence with entity type definitions; output is a JSON object mapping entity types to extracted entities

  38. [38]

    Input is a report segment with a list of candidate indicators; output labels each candidate as IoC vs

    LANCE(Froudakis et al., 2025): IoC identification (Prism meta task). Input is a report segment with a list of candidate indicators; output labels each candidate as IoC vs. non-IoC (aggregated over IP/URL/Domain/Hash subtasks)

  39. [39]

    This meta task aggregates four subtasks (entity extraction, entity typing, relation existence, relation label) with XML-like tag formats

    AnnoCTR(Lange et al., 2024): STIX-style entity and relation extraction. This meta task aggregates four subtasks (entity extraction, entity typing, relation existence, relation label) with XML-like tag formats

  40. [40]

    zero-reward

    AZERG(Lekssays et al., 2025): STIX-style entity and relation extraction. This meta task aggregates four subtasks (entity extraction, entity typing, relation existence, relation label) with XML-like tag formats. F. Additional Theory: Why MinervaRL Can Expand Empirical Support Setup (answer-level view).Each training instance is (x, a⋆), where x∈ X is the or...

  41. [41]

    The number of ACR attempts until the first accepted verified trace is geometric with mean≤1/α

  42. [42]

    Consequently, once pθ(a⋆ |x)≥ε k,ζ, the probability that the base RLVR rollouts on x remain all-zero at budget k is at mostζ(Lemma F .1)

    AfterN= l logε k,ζ −logp 0 ∆ m successful distillation updates on(x, y acr), we havep θ(a⋆ |x)≥ε k,ζ. Consequently, once pθ(a⋆ |x)≥ε k,ζ, the probability that the base RLVR rollouts on x remain all-zero at budget k is at mostζ(Lemma F .1). Proof sketch.Item (1) is immediate from Assumption F.3. Item (2) follows by iterating Assumption F.4. The final state...