Recognition: 1 theorem link
· Lean TheoremMinerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs
Pith reviewed 2026-05-16 09:20 UTC · model grok-4.3
The pith
Reinforcement learning guided by deterministic verifiers raises LLM accuracy on cyber threat intelligence tasks by 15.8 points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MinervaRL is a self-training mechanism that generates additional verified trajectories from task-specific verifiers and distills them back into the model; this process yields an average improvement of 15.8 percentage points over corresponding base models and 4.3 points over GRPO when evaluated across four backbones and twelve CTI benchmarks.
What carries the argument
MinervaRL, the lightweight self-training mechanism that generates and distills verified trajectories to address reward sparsity during reinforcement learning with verifiable rewards.
If this is right
- Structured CTI outputs become more reliable under RLVR than under supervised fine-tuning alone.
- Canonical identifiers and schemas in CTI resources enable scalable, deterministic reward signals without large new annotation efforts.
- Performance gains hold across multiple LLM backbones when the same verifier pipeline is applied.
- Self-training on verified trajectories reduces the impact of reward sparsity in rollout phases.
Where Pith is reading between the lines
- The same verifier-driven RL approach could transfer to other domains that maintain canonical schemas, such as medical coding or legal document structuring.
- As CTI standards evolve, the method may lower the cost of retraining models by relying on updated verifiers rather than full supervised datasets.
- Extending verifier coverage to rare edge cases in threat artifacts would likely produce additional measurable gains beyond the reported averages.
Load-bearing premise
The task-specific verifiers must produce accurate, unbiased, and sufficiently dense rewards without systematic errors or coverage gaps that could mislead the reinforcement learning process.
What would settle it
Run the trained models on a fresh set of CTI artifacts and compare outputs against independent expert annotations or alternative verification tools; failure to match the reported gains or consistent verifier-expert disagreement would falsify the improvement claim.
Figures
read the original abstract
Cyber threat intelligence (CTI) analysts routinely convert noisy, unstructured security artifacts into standardized, automation-ready representations. Although large language models (LLMs) show promise for this task, existing approaches remain brittle when producing structured CTI outputs and have largely relied on supervised fine-tuning (SFT). In contrast, CTI standards and community-maintained resources define canonical identifiers and schemas that enable deterministic verification of model outputs. We leverage this structure to study reinforcement learning with verifiable rewards (RLVR) for CTI tasks. We introduce Minerva, a unified dataset and training pipeline spanning multiple CTI subtasks, each paired with task-specific verifiers that score structured outputs and identifier predictions. To address reward sparsity during rollout, we propose MinervaRL, a lightweight self-training mechanism that generates additional verified trajectories and distills them back into the model. Averaged across four backbones and 12 CTI benchmarks, MinervaRL improves the mean score by 15.8 percentage points over the corresponding base models and by 4.3 points over GRPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Minerva, a unified dataset and training pipeline for cyber threat intelligence (CTI) subtasks that pairs each task with deterministic, schema-based verifiers to enable reinforcement learning with verifiable rewards (RLVR). It proposes MinervaRL, a lightweight self-training mechanism that generates additional verified trajectories to mitigate reward sparsity and distills them back into the policy. Averaged across four backbones and 12 CTI benchmarks, the method is reported to improve mean scores by 15.8 percentage points over base models and 4.3 points over GRPO.
Significance. If the verifiers prove reliable and the gains hold under scrutiny, the work would provide concrete evidence that RLVR can improve structured output generation in a high-stakes domain by exploiting community-maintained schemas, offering a scalable path beyond supervised fine-tuning for tasks with verifiable outputs.
major comments (2)
- [Abstract] Abstract: the headline gains (15.8 pp over base models, 4.3 pp over GRPO) are presented without any experimental details, ablation studies, verifier definitions, statistical controls, or tables, rendering the central empirical claim impossible to evaluate from the supplied text.
- [MinervaRL] MinervaRL self-training description: the claim that distilling verifier-labeled rollouts improves capability rests on the untested assumption that the task-specific verifiers are accurate, unbiased, and free of coverage gaps; no evidence on edge-case handling, partial-match scoring, or human-expert agreement is supplied, which is load-bearing for ruling out reward hacking.
minor comments (1)
- The abstract refers to 'four backbones and 12 CTI benchmarks' without naming them, which reduces reproducibility and clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of clarity and empirical rigor that we will address in the revision. We provide point-by-point responses below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline gains (15.8 pp over base models, 4.3 pp over GRPO) are presented without any experimental details, ablation studies, verifier definitions, statistical controls, or tables, rendering the central empirical claim impossible to evaluate from the supplied text.
Authors: We agree that the abstract, constrained by length, presents results at a high level without the supporting details. The full manuscript details the experimental setup (four backbones, 12 benchmarks), verifier definitions, and ablations in Sections 2 and 4, along with the main results table. We will revise the abstract to include a concise reference to the experimental scope (e.g., 'across four backbones and 12 CTI benchmarks') and the core methodological elements, while staying within word limits. This will improve standalone evaluability without duplicating the body of the paper. revision: partial
-
Referee: [MinervaRL] MinervaRL self-training description: the claim that distilling verifier-labeled rollouts improves capability rests on the untested assumption that the task-specific verifiers are accurate, unbiased, and free of coverage gaps; no evidence on edge-case handling, partial-match scoring, or human-expert agreement is supplied, which is load-bearing for ruling out reward hacking.
Authors: We acknowledge the importance of demonstrating verifier reliability to support the RLVR claims and rule out reward hacking. The verifiers are deterministic and directly implement community CTI schemas and standards as described in Section 2.2, with explicit scoring rules for structured outputs and identifiers. However, the manuscript does not include human-expert agreement metrics or dedicated edge-case analysis. We will add a new subsection (in the experiments) reporting a human validation study on a held-out sample of 200 outputs, measuring inter-annotator agreement with expert CTI analysts, and discussing coverage for partial matches and edge cases. This revision will directly address the concern. revision: yes
Circularity Check
No circularity: empirical RL gains rest on external verifiers and benchmarks
full rationale
The paper reports empirical performance lifts from training LLMs with task-specific deterministic verifiers on CTI schemas. No equations, derivations, or self-referential definitions are presented that reduce the reported scores to fitted parameters or prior outputs by construction. The MinervaRL self-training step generates new trajectories scored by the same external schema verifiers and distills them; this is a standard RL loop whose success is measured on held-out benchmarks rather than being tautological. Self-citations, if present, are not load-bearing for the central claim. The result is therefore self-contained against external evaluation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We leverage this structure to study reinforcement learning with verifiable rewards (RLVR) for CTI tasks. ... task-specific verifiers that score structured outputs and identifier predictions.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Byers, R., Turner, C., and Brewer, T
URL https://www.usenix.org/system/ files/usenixsecurity25-buechel.pdf. Byers, R., Turner, C., and Brewer, T. National Vulnerability Database. https://doi.org/10.18434/M3436,
-
[2]
Reasoning with Exploration: An Entropy Perspective
National Institute of Standards and Technology (NIST). Accessed: 2026-01-19. Center for Threat-Informed Defense. Threat report ATT&CK mapper (TRAM). Project page, 2023. URL https://ctid.mitre.org/projects/ threat-report-attck-mapper-tram/. Center for Threat-Informed Defense. Map- pings Explorer. https://github.com/ center-for-threat-informed-defense/ mapp...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501 2026
-
[3]
URL https: //arxiv.org/abs/2408.09304
doi: 10.48550/arXiv.2408.09304. URL https: //arxiv.org/abs/2408.09304. Levi, M., Ohayon, D., Blobstein, A., Sagi, R., Molloy, I., and Allouche, Y . Toward cybersecurity-expert small language models.arXiv preprint arXiv:2510.14113,
-
[4]
URL https: //arxiv.org/abs/2510.14113
doi: 10.48550/arXiv.2510.14113. URL https: //arxiv.org/abs/2510.14113. Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., Li, J. D., Dombrowski, A.-K., Goel, S., Phan, L., et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning.arXiv preprint arXiv:2403.03218, 2024. Liu, X., Tan, Y ., Xiao, Z., Zhuge, J., and Zhou, R. Not...
-
[5]
Proximal Policy Optimization Algorithms
URL https://tsapps.nist.gov/ publication/get_pdf.cfm?pub_id=958028. Red Canary. Atomic Red Team. https://github. com/redcanaryco/atomic-red-team, 2026. Accessed: 2026-01-19. 10 Minerva Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. doi: 10.48550/ arXiv.170...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
doi: 10.48550/arXiv.2402.03300. SigmaHQ. Sigma Main Rule Repository. https:// github.com/SigmaHQ/sigma, 2026. Accessed: 2026-01-19. Splunk. Splunk Security Content. https://github. com/splunk/security_content, 2026. Ac- cessed: 2026-01-19. Strom, B. E., Applebaum, A., Miller, D. P., Nickels, K. C., Pennington, A. G., and Thomas, C. B. MITRE ATT&CK: Design...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2026
-
[7]
Mappings-Explorer CVE →ATT&CK Exploitation.This task maps vulnerability descriptions to the ATT&CK technique directly used for exploitation. Given a CVE description as input, the model predicts a single ATT&CK technique identifier (formatted as Txxxx or Txxxx.yyy). Ground-truth labels are obtained from the Center for Threat-Informed Defense Mappings Explo...
work page 2026
-
[8]
Mappings-Explorer CVE→ATT&CK Primary Impact.This task focuses on the main adversarial impact resulting from successful exploitation. The input is the CVE description, and the target is a single ATT&CK technique identifier corresponding to the primary post-exploitation effect (e.g., credential access or privilege escalation). Labels are sourced from the Ma...
work page 2026
-
[9]
Mappings-Explorer CVE→ATT&CK Secondary Impact.This task predicts a subsequent impact enabled by the primary impact. The input consists of the CVE description concatenated with the given primary-impact technique ID, and the target is a single ATT&CK technique identifier representing the secondary effect. Ground-truth annotations are derived from the Mappin...
work page 2026
-
[10]
Sigma →ATT&CK Tactics.This task infers high-level adversary intent from detection logic. Given a Sigma rule excerpt—including the rule title, logsource, and detection fields—the model predicts a multi-label set of ATT&CK tactic identifiers (formatted as TA000x). Sigma rules are sourced from the public Sigma repository, and annotations are expressed using ...
work page 2026
-
[11]
Sigma→ATT&CK Technique.This task maps detection logic to the specific adversarial behavior it is designed to identify. Using the same Sigma rule excerpt as input, the model predicts a single ATT&CK technique identifier (formatted as Txxxx or Txxxx.yyy). Rules are drawn from the Sigma repository, with targets aligned to canonical ATT&CK technique identifie...
work page 2026
-
[12]
Atomic Red Team→ATT&CK Technique.This task maps adversary procedure descriptions to their corresponding ATT&CK techniques. The input is an Atomic Red Team procedure snippet that includes execution steps or commands and platform context, and the target is a single ATT&CK technique identifier. Examples are drawn from the Atomic Red Team repository, which is...
work page 2026
-
[13]
Microsoft Sentinel →ATT&CK Technique.This task links analytics rules to the ATT&CK techniques they are intended to detect. The input comprises the rule title, description, and associated KQL query, and the target is a single ATT&CK technique identifier. Rules are sourced from the Microsoft Sentinel content repository, with labels normalized to the ATT&CK ...
work page 2026
-
[14]
Splunk Security Content →ATT&CK Technique.This task maps SPL-based detection content to the ATT&CK technique it targets. The input consists of an SPL query, its detection narrative, and metadata, and the target is a single ATT&CK technique identifier. Content is obtained from Splunk Security Content, with annotations expressed using canonical ATT&CK techn...
work page 2026
-
[15]
ATT&CK Scenario→Technique.This task identifies the ATT&CK technique that best corresponds to a described adversary scenario. The input is a scenario text derived from ATT&CK procedure examples, and the target is a single 12 Minerva Table 6.Minerva training datasets and split sizes. Dataset Target Train Val Mappings-Explorer CVE→ATT&CK Exploitation ATT&CK ...
-
[16]
ATT&CK Scenario→Tactics.This task infers the adversary intent categories implied by a scenario. Given the same scenario text as input, the model predicts a multi-label set of ATT&CK tactic identifiers ( TA000x) associated with the underlying behavior. Annotations are derived from the ATT&CK taxonomy and its structured releases (MITRE Corporation, 2026d;a)
-
[17]
ATT&CK Scenario→Mitigations.This task predicts mitigations relevant to the behaviors described in a scenario. The input is the ATT&CK scenario text, and the target is a multi-label set of ATT&CK mitigation identifiers (Mxxxx) associated with the corresponding techniques. Mitigation mappings are obtained from the ATT&CK knowledge base (MITRE Corporation, 2026d;a)
-
[18]
NVD CVE→CWE.This task maps vulnerability descriptions to their underlying weakness categories. The input is the CVE description text, and the target is a multi-label set of CWE identifiers (formatted as CWE-xxx). CVE records are obtained from the National Vulnerability Database, with labels aligned to the CWE taxonomy (Byers et al., 2022; MITRE Corporatio...
work page 2022
-
[19]
NVD CVE→CVSS v3.1.This task predicts the CVSS v3.1 base vector associated with a vulnerability. Given a CVE de- scription as input, the model outputs the corresponding CVSS v3.1 base vector string (e.g., CVSS:3.1/AV:N/...). CVE entries are sourced from the National Vulnerability Database, and targets follow the official CVSS v3.1 specification (Byers et a...
work page 2022
-
[20]
ATT&CK Threat Actor Attribution.This task performs threat actor attribution from observed behaviors expressed as procedure text. Examples are derived from MITRE ATT&CK Enterprise intrusion-set (group) entries by collecting each actor’stechniques usedrelationships and extracting the associated procedure descriptions that characterize how the actor operates...
-
[21]
CAPEC Example→CAPEC.This task maps attack example narratives to the CAPEC attack pattern they exemplify. The input is a CAPEC example description, and the target is a single CAPEC identifier (formatted asCAPEC-xxx). Both example texts and pattern identifiers are sourced from the CAPEC catalog (MITRE Corporation, 2026b)
-
[22]
CAPEC Example →CWE.This task associates CAPEC attack examples with the underlying software weakness categories they exploit. The input is the same CAPEC example narrative, and the target is a multi-label set of CWE identifiers (formatted as CWE-xxx). Example descriptions are drawn from CAPEC, with labels aligned to the CWE taxonomy (MITRE Corporation, 202...
-
[23]
Leakage: says or implies the answer/label/options/reference were provided, or quotes/paraphrases provided reference text instead of reasoning from the prompt
-
[24]
Incoherent: loops, repeated phrases/lines, templated filler, or gibberish
-
[25]
Ungrounded: invents concrete details not in the question (extra CVEs, vendors, malware names, IOCs, dates, techniques, etc.)
-
[26]
Mismatch: reasoning supports a different label than the final answer, or directly contradicts it
-
[27]
Unsupported: provides reasoning but does not use evidence from the prompt to justify the answer (hand-wavy or irrelevant justification), OR provides an answer with no reasoning at all (answer-only)
-
[28]
Other: refusals, policy/meta artifacts, generic CTI tutorials, or prompt copying. Example outputs: {”label”: ”GOOD”} {”label”: ”BAD”, ”categoryid”: 1, ”category title”: ”Leakage”} QUESTION: {QUESTION} RESPONSE: {RESPONSE} Figure 8.Prompt template used to label reward-correct responses as GOOD/BAD for training the TextCNN reasoning-quality classifier. The ...
work page 2025
-
[29]
CKT(Alam et al., 2025): Cyber Threat Intelligence multiple-choice QA. Input is a prompt with five options (A–E); output is a single letter A–E on the final line (optional brief justification allowed)
work page 2025
-
[30]
Input is a prompt with four options (A–D); output is a single letter A–D on the final line
CyberMetric(Tihanyi et al., 2024): general cybersecurity multiple-choice QA. Input is a prompt with four options (A–D); output is a single letter A–D on the final line
work page 2024
-
[31]
SOCEval(Deason et al., 2025): multi-select reasoning over threat-intelligence reports. Input includes report context and a question with options; output is a JSON object wrapped in<json object> tags containing a correct answers list
work page 2025
-
[32]
Input is a CVE description; output is a single CWE-####identifier on the final line
RCM(Alam et al., 2025): root-cause mapping from CVE to CWE. Input is a CVE description; output is a single CWE-####identifier on the final line
work page 2025
-
[33]
Input is a CVE description; output is a full CVSS v3.1 vector string (e.g.,CVSS:3.1/AV:N/...)
VSP(Alam et al., 2025): vulnerability scoring prediction. Input is a CVE description; output is a full CVSS v3.1 vector string (e.g.,CVSS:3.1/AV:N/...)
work page 2025
-
[34]
ATE(Alam et al., 2025): ATT&CK technique extraction. Input describes attacker behavior (Windows environment); output is a single technique IDT####orT####.###on the final line. 18 Minerva
work page 2025
-
[35]
Input describes an attack scenario; output is a set of mitigation IDs M10xx
RMS(Alam et al., 2025): risk mitigation strategy. Input describes an attack scenario; output is a set of mitigation IDs M10xx
work page 2025
-
[36]
Input contains a detection rule (query + metadata); output is a single ATT&CK technique IDT####
ElasticRule:Elastic rule to ATT&CK technique mapping. Input contains a detection rule (query + metadata); output is a single ATT&CK technique IDT####
-
[37]
APTNER(Wang et al., 2022): APT-focused named-entity recognition. Input is a sentence with entity type definitions; output is a JSON object mapping entity types to extracted entities
work page 2022
-
[38]
LANCE(Froudakis et al., 2025): IoC identification (Prism meta task). Input is a report segment with a list of candidate indicators; output labels each candidate as IoC vs. non-IoC (aggregated over IP/URL/Domain/Hash subtasks)
work page 2025
-
[39]
AnnoCTR(Lange et al., 2024): STIX-style entity and relation extraction. This meta task aggregates four subtasks (entity extraction, entity typing, relation existence, relation label) with XML-like tag formats
work page 2024
-
[40]
AZERG(Lekssays et al., 2025): STIX-style entity and relation extraction. This meta task aggregates four subtasks (entity extraction, entity typing, relation existence, relation label) with XML-like tag formats. F. Additional Theory: Why MinervaRL Can Expand Empirical Support Setup (answer-level view).Each training instance is (x, a⋆), where x∈ X is the or...
work page 2025
-
[41]
The number of ACR attempts until the first accepted verified trace is geometric with mean≤1/α
-
[42]
AfterN= l logε k,ζ −logp 0 ∆ m successful distillation updates on(x, y acr), we havep θ(a⋆ |x)≥ε k,ζ. Consequently, once pθ(a⋆ |x)≥ε k,ζ, the probability that the base RLVR rollouts on x remain all-zero at budget k is at mostζ(Lemma F .1). Proof sketch.Item (1) is immediate from Assumption F.3. Item (2) follows by iterating Assumption F.4. The final state...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.