pith. sign in

arxiv: 2605.00245 · v1 · submitted 2026-04-30 · 💻 cs.AI

ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts

Pith reviewed 2026-05-09 19:55 UTC · model grok-4.3

classification 💻 cs.AI
keywords militarybenchmarkdecisionsafetyevaluationapplicationsarmorcontexts
0
0 comments X

The pith

The ARMOR 2025 benchmark tests LLMs against military doctrines and reveals critical safety alignment gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ARMOR 2025, a benchmark designed to evaluate large language models for use in military decision support. It creates multiple-choice questions directly from doctrinal sources including the Law of War, the Rules of Engagement, and the Joint Ethics Regulation. These questions are organized into a taxonomy based on the Observe Orient Decide Act framework to cover relevant decision types. The structure allows testing whether models give accurate responses or appropriately refuse prohibited actions. Results from evaluating 21 models point to important shortfalls in how well current systems handle military-specific legal and ethical constraints.

Core claim

ARMOR 2025 extracts text from three core military doctrines to generate 519 doctrinally grounded multiple-choice questions. The questions are arranged in a structured 12-category taxonomy informed by the OODA decision making framework. When this benchmark is applied to 21 commercial large language models, the evaluation procedures show that these models have critical gaps in safety alignment for military applications.

What carries the argument

The central object is the ARMOR 2025 benchmark, which consists of multiple-choice questions generated to preserve the meaning of doctrinal rules and organized by a 12-category taxonomy based on the OODA framework for systematic testing of accuracy and refusal.

Load-bearing premise

The generated multiple-choice questions accurately preserve the intended meaning of each doctrinal rule without introducing bias or distortion.

What would settle it

Independent verification by military legal experts that the questions match the doctrines combined with high model accuracy on the benchmark would show that the safety gaps are not as critical as reported.

Figures

Figures reproduced from arXiv: 2605.00245 by Chaoyu Zhang, Heng Jin, Sydney Johns, Wenjing Lou, Y. Thomas Hou.

Figure 1
Figure 1. Figure 1: ARMOR 2025 Taxonomy and Benchmark Generation Workflow. The top illustrates a 12-category taxonomy of battlefield risks. The bottom depicts the benchmark generation process, beginning with doctrine rule abstraction, LLM based question drafting, automated validation and deduplication, and final question set manual review and construction. raw doctrinal clauses into structured multiple-choice questions. A kno… view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy of language models across doctrinal categories in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Large language models (LLMs) are now being explored for defense applications that require reliable and legally compliant decision support. They also hold significant potential to enhance decision making, coordination, and operational efficiency in military contexts. These uses demand evaluation methods that reflect the doctrinal standards that guide real military operations. Existing safety benchmarks focus on general social risks and do not test whether models follow the legal and ethical rules that govern real military operations. To address this gap, we introduce ARMOR 2025, a military aligned safety benchmark grounded in three core military doctrines the Law of War, the Rules of Engagement, and the Joint Ethics Regulation. We extract doctrinal text from these sources and generate multiple choice questions that preserve the intended meaning of each rule. The benchmark is organized through a taxonomy informed by the Observe Orient Decide Act (OODA) decision making framework. This structure enables systematic testing of accuracy and refusal across military relevant decision types. This benchmark features a structured 12-category taxonomy, 519 doctrinally grounded prompts, and rigorous evaluation procedures applied to 21 commercial LLMs. Evaluation results reveal critical gaps in safety alignment for military applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ARMOR 2025, a military-aligned safety benchmark consisting of 519 multiple-choice questions derived from the Law of War, Rules of Engagement, and Joint Ethics Regulation. Questions are generated to preserve doctrinal meaning and organized under a 12-category taxonomy informed by the OODA decision-making framework. The benchmark is applied to evaluate accuracy and refusal behaviors across 21 commercial LLMs, with the central claim that results demonstrate critical gaps in safety alignment for military applications beyond existing civilian-focused benchmarks.

Significance. If the questions validly capture doctrinal requirements without distortion, the benchmark would address a genuine gap by providing a structured, military-specific evaluation tool grounded in real operational doctrines rather than generic social risks. The OODA-informed taxonomy and scale (519 prompts, 21 models) enable systematic testing across decision types, which could inform development of LLMs for defense contexts. The paper's strength lies in its explicit grounding in primary doctrinal sources and the attempt to move beyond civilian safety benchmarks.

major comments (2)
  1. [Benchmark construction and question generation] The central claim that evaluation results reveal critical gaps in safety alignment depends on the 519 MCQs accurately preserving the intended meaning of doctrinal rules without introducing bias or distortion. However, no details are provided on validation procedures such as independent expert review, inter-rater agreement metrics, or back-translation checks for the generated questions (see abstract and the benchmark construction description). Without this, model failures may reflect artifacts in question phrasing or distractors rather than true alignment deficiencies.
  2. [Evaluation results and abstract] The abstract states that 'evaluation results reveal critical gaps' and that the benchmark features 'rigorous evaluation procedures,' but no specific performance numbers, per-model or per-category breakdowns, tables of accuracy/refusal rates, or statistical comparisons are referenced in the provided text. This makes it impossible to evaluate the magnitude, consistency, or statistical significance of the reported gaps.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one or two key quantitative results (e.g., overall accuracy range across models or refusal rates in specific OODA categories) to support the 'critical gaps' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential value of ARMOR 2025 in addressing a gap in military-specific LLM safety evaluation. We address each major comment below and will revise the manuscript accordingly to improve transparency and clarity.

read point-by-point responses
  1. Referee: [Benchmark construction and question generation] The central claim that evaluation results reveal critical gaps in safety alignment depends on the 519 MCQs accurately preserving the intended meaning of doctrinal rules without introducing bias or distortion. However, no details are provided on validation procedures such as independent expert review, inter-rater agreement metrics, or back-translation checks for the generated questions (see abstract and the benchmark construction description). Without this, model failures may reflect artifacts in question phrasing or distractors rather than true alignment deficiencies.

    Authors: We agree that explicit validation procedures are essential to substantiate that the questions preserve doctrinal meaning without distortion. The manuscript describes direct extraction from primary sources (Law of War, Rules of Engagement, and Joint Ethics Regulation) followed by generation of multiple-choice items designed to retain intended meaning, organized under the OODA-informed taxonomy. However, we did not provide details on independent expert review or quantitative agreement metrics. In the revised version, we will add a dedicated subsection on benchmark construction and validation. This will include a description of the internal review process used by the author team (with relevant domain expertise), any available inter-rater checks performed during generation, and clarification that no back-translation was applied as the source material is in English. We believe this will strengthen the claim that observed failures reflect alignment gaps rather than artifacts. revision: yes

  2. Referee: [Evaluation results and abstract] The abstract states that 'evaluation results reveal critical gaps' and that the benchmark features 'rigorous evaluation procedures,' but no specific performance numbers, per-model or per-category breakdowns, tables of accuracy/refusal rates, or statistical comparisons are referenced in the provided text. This makes it impossible to evaluate the magnitude, consistency, or statistical significance of the reported gaps.

    Authors: The full manuscript includes an Evaluation section with detailed results: tables reporting accuracy and refusal rates for each of the 21 models, per-category breakdowns across the 12 OODA-informed categories, and overall statistics. The abstract provides a high-level summary of these findings, as is standard. To address the concern, we will revise the abstract to explicitly reference key quantitative results (e.g., overall accuracy ranges and the most pronounced gaps) while remaining concise. We will also add clearer in-text references to the specific tables and figures in the results section to improve accessibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or evaluation

full rationale

The paper introduces ARMOR 2025 by extracting doctrinal text from the Law of War, Rules of Engagement, and Joint Ethics Regulation, generating 519 multiple-choice questions organized under a 12-category OODA taxonomy, and directly evaluating 21 LLMs. No equations, fitted parameters, or derived predictions appear. The benchmark is presented as a novel construction independent of prior fitted results or self-referential derivations. No load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to justify core claims. Evaluation results follow from direct testing against the constructed items without reducing to inputs by definition. This is a standard benchmark paper whose central claims rest on external doctrinal sources and new question generation rather than circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that doctrinal texts can be faithfully converted into test questions and that the resulting benchmark measures real military safety alignment.

axioms (2)
  • domain assumption Extracted doctrinal text from the Law of War, Rules of Engagement, and Joint Ethics Regulation can be converted into multiple-choice questions that preserve original meaning.
    Invoked when generating the 519 prompts from the three core sources.
  • domain assumption The OODA framework provides an appropriate taxonomy for organizing military decision types for LLM testing.
    Used to structure the 12-category benchmark.

pith-pipeline@v0.9.0 · 5509 in / 1326 out tokens · 34282 ms · 2026-05-09T19:55:54.661408+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Large language model (llm) for telecommu- nications: A comprehensive survey on principles, key techniques, and opportunities,

    H. Zhou, C. Hu, Y . Yuan, Y . Cui, Y . Jin, C. Chen, H. Wu, D. Yuan, L. Jiang, D. Wuet al., “Large language model (llm) for telecommu- nications: A comprehensive survey on principles, key techniques, and opportunities,”IEEE Communications Surveys & Tutorials, 2024

  2. [2]

    A survey on large language models: Applications, challenges, limitations, and practical usage,

    M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, N. Akhtar, J. Wu, S. Mirjaliliet al., “A survey on large language models: Applications, challenges, limitations, and practical usage,”Authorea Preprints, 2023

  3. [3]

    A Safe Harbor for AI Evaluation and Red Teaming, 2024

    S. Longpre, S. Kapoor, K. Klyman, A. Ramaswami, R. Bommasani, B. Blili-Hamelin, Y . Huang, A. Skowron, Z.-X. Yong, S. Kotha, Y . Zeng, W. Shi, X. Yang, R. Southen, A. Robey, P. Chao, D. Yang, R. Jia, D. Kang, S. Pentland, A. Narayanan, P. Liang, and P. Henderson, “A safe harbor for ai evaluation and red teaming,”arXiv preprint arXiv:2403.04893, 2024

  4. [4]

    Air-bench 2024: A safety benchmark based on risk categories from regulations and policies,

    Y . Zeng, Y . Yang, A. Zhou, J. Z. Tan, Y . Tu, Y . Mai, K. Klyman, M. Pan, R. Jia, D. Song, P. Liang, and B. Li, “Air-bench 2024: A safety benchmark based on risk categories from regulations and policies,” International Conference on Learning Representations(ICLR), 2025

  5. [5]

    Fine-tuning aligned language models compromises safety, even when users do not intend to!

    X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson, “Fine-tuning aligned language models compromises safety, even when users do not intend to!”International Conference on Learning Repre- sentations (ICLR), 2024

  6. [6]

    Mart: Improving llm safety with multi-round automatic red-teaming,

    S. Ge, C. Zhou, R. Hou, M. Khabsa, Y .-C. Wang, Q. Wang, J. Han, and Y . Mao, “Mart: Improving llm safety with multi-round automatic red-teaming,” inNorth American Chapter of the Association for Com- putational Linguistics (NAACL), 2024

  7. [7]

    Sorry-bench: Systematically evaluating large language model safety refusal,

    T. Xie, X. Qi, Y . Zeng, Y . Huang, U. M. Sehwag, K. Huang, L. He, B. Wei, D. Li, Y . Sheng, R. Jia, B. Li, K. Li, D. Chen, P. Henderson, and P. Mittal, “Sorry-bench: Systematically evaluating large language model safety refusal,” inInternational Conference on Learning Representations (ICLR), 2025

  8. [8]

    Data, analytics, and artificial intelli- gence adoption strategy,

    U.S. Department of Defense, “Data, analytics, and artificial intelli- gence adoption strategy,” https://www.defense.gov/News/News-Stories/ Article/Article/3578219/dod-releases-ai-adoption-strategy/, 2023, ac- cessed: 2024-07-20

  9. [9]

    Executive order on removing bar- riers to american leadership in artificial intelligence,

    The White House, “Executive order on removing bar- riers to american leadership in artificial intelligence,” https://www.whitehouse.gov/presidential-actions/2025/01/ removing-barriers-to-american-leadership-in-artificial-intelligence/, 2025, accessed: 2025-05-19

  10. [10]

    Department of defense law of war manual,

    U.S. Department of Defense, “Department of defense law of war manual,” https://media.defense.gov/2023/Jul/31/2003271432/-1/-1/0/ DOD-LAW-OF-W AR-MANUAL-JUNE-2015-UPDATED-JULY% 202023.PDF, 2023, updated July 2023, originally published June 2015. Accessed: 2025-05-15

  11. [11]

    Law of war and rules of engagement (tbs b130936),

    United States Marine Corps Training and Education Command, “Law of war and rules of engagement (tbs b130936),” https://www.trngcmd.marines.mil/Portals/207/Docs/TBS/B130936% 20Law%20of%20War%20and%20Rules%20Of%20Engagement.pdf, 2020, accessed: 2024-07-15

  12. [12]

    Joint ethics regulation (jer) dod 5500.7- r,

    U.S. Department of Defense, “Joint ethics regulation (jer) dod 5500.7- r,” https://ig.army.mil/Portals/101/TIGS/DIGITAL%20LIBRARY/ Reference%20Booklets/DoD%20Directives/d55007p.pdf?ver= y0P-XjjMvONcYCKArycp1A\%3D\%3D, 1993, current as of amendments through 2011. Accessed: 2025-05-15

  13. [13]

    Holistic evaluation of language models,

    P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumaret al., “Holistic evaluation of language models,”Transactions on Machine Learning Research, 2022

  14. [14]

    Truthfulqa: Measuring how models mimic human falsehoods,

    S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models mimic human falsehoods,” inAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

  15. [15]

    Realtoxicityprompts: Evaluating neural toxic degeneration in language models,

    S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith, “Realtoxicityprompts: Evaluating neural toxic degeneration in language models,”Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356–3369, 2020

  16. [16]

    Harmbench: a standardized evaluation framework for automated red teaming and robust refusal,

    M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Liet al., “Harmbench: a standardized evaluation framework for automated red teaming and robust refusal,” inProceed- ings of the 41st International Conference on Machine Learning, 2024, pp. 35 181–35 224

  17. [17]

    Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models

    L. Li, B. Dong, R. Wang, X. Hu, W. Zuo, D. Lin, Y . Qiao, and J. Shao, “Salad-bench: A hierarchical and comprehensive safety benchmark for large language models,”arXiv preprint arXiv:2402.05044, 2024

  18. [18]

    SafeLawBench: Towards Safe Alignment of Large Language Models,

    C. Cao, H. Zhu, J. Ji, Q. Sun, Z. Zhu, Y . Wu, J. Dai, Y . Yang, S. Han, and Y . Guo, “SafeLawBench: Towards Safe Alignment of Large Language Models,” Jun. 2025. [Online]. Available: http://arxiv.org/abs/2506.06636

  19. [19]

    Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain,

    D. C. Ruiz and J. Sell, “Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain,” Oct. 2024. [Online]. Available: http://arxiv.org/abs/2410.20297