ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts
Pith reviewed 2026-05-09 19:55 UTC · model grok-4.3
The pith
The ARMOR 2025 benchmark tests LLMs against military doctrines and reveals critical safety alignment gaps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ARMOR 2025 extracts text from three core military doctrines to generate 519 doctrinally grounded multiple-choice questions. The questions are arranged in a structured 12-category taxonomy informed by the OODA decision making framework. When this benchmark is applied to 21 commercial large language models, the evaluation procedures show that these models have critical gaps in safety alignment for military applications.
What carries the argument
The central object is the ARMOR 2025 benchmark, which consists of multiple-choice questions generated to preserve the meaning of doctrinal rules and organized by a 12-category taxonomy based on the OODA framework for systematic testing of accuracy and refusal.
Load-bearing premise
The generated multiple-choice questions accurately preserve the intended meaning of each doctrinal rule without introducing bias or distortion.
What would settle it
Independent verification by military legal experts that the questions match the doctrines combined with high model accuracy on the benchmark would show that the safety gaps are not as critical as reported.
Figures
read the original abstract
Large language models (LLMs) are now being explored for defense applications that require reliable and legally compliant decision support. They also hold significant potential to enhance decision making, coordination, and operational efficiency in military contexts. These uses demand evaluation methods that reflect the doctrinal standards that guide real military operations. Existing safety benchmarks focus on general social risks and do not test whether models follow the legal and ethical rules that govern real military operations. To address this gap, we introduce ARMOR 2025, a military aligned safety benchmark grounded in three core military doctrines the Law of War, the Rules of Engagement, and the Joint Ethics Regulation. We extract doctrinal text from these sources and generate multiple choice questions that preserve the intended meaning of each rule. The benchmark is organized through a taxonomy informed by the Observe Orient Decide Act (OODA) decision making framework. This structure enables systematic testing of accuracy and refusal across military relevant decision types. This benchmark features a structured 12-category taxonomy, 519 doctrinally grounded prompts, and rigorous evaluation procedures applied to 21 commercial LLMs. Evaluation results reveal critical gaps in safety alignment for military applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ARMOR 2025, a military-aligned safety benchmark consisting of 519 multiple-choice questions derived from the Law of War, Rules of Engagement, and Joint Ethics Regulation. Questions are generated to preserve doctrinal meaning and organized under a 12-category taxonomy informed by the OODA decision-making framework. The benchmark is applied to evaluate accuracy and refusal behaviors across 21 commercial LLMs, with the central claim that results demonstrate critical gaps in safety alignment for military applications beyond existing civilian-focused benchmarks.
Significance. If the questions validly capture doctrinal requirements without distortion, the benchmark would address a genuine gap by providing a structured, military-specific evaluation tool grounded in real operational doctrines rather than generic social risks. The OODA-informed taxonomy and scale (519 prompts, 21 models) enable systematic testing across decision types, which could inform development of LLMs for defense contexts. The paper's strength lies in its explicit grounding in primary doctrinal sources and the attempt to move beyond civilian safety benchmarks.
major comments (2)
- [Benchmark construction and question generation] The central claim that evaluation results reveal critical gaps in safety alignment depends on the 519 MCQs accurately preserving the intended meaning of doctrinal rules without introducing bias or distortion. However, no details are provided on validation procedures such as independent expert review, inter-rater agreement metrics, or back-translation checks for the generated questions (see abstract and the benchmark construction description). Without this, model failures may reflect artifacts in question phrasing or distractors rather than true alignment deficiencies.
- [Evaluation results and abstract] The abstract states that 'evaluation results reveal critical gaps' and that the benchmark features 'rigorous evaluation procedures,' but no specific performance numbers, per-model or per-category breakdowns, tables of accuracy/refusal rates, or statistical comparisons are referenced in the provided text. This makes it impossible to evaluate the magnitude, consistency, or statistical significance of the reported gaps.
minor comments (1)
- [Abstract] The abstract would be strengthened by including at least one or two key quantitative results (e.g., overall accuracy range across models or refusal rates in specific OODA categories) to support the 'critical gaps' claim.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential value of ARMOR 2025 in addressing a gap in military-specific LLM safety evaluation. We address each major comment below and will revise the manuscript accordingly to improve transparency and clarity.
read point-by-point responses
-
Referee: [Benchmark construction and question generation] The central claim that evaluation results reveal critical gaps in safety alignment depends on the 519 MCQs accurately preserving the intended meaning of doctrinal rules without introducing bias or distortion. However, no details are provided on validation procedures such as independent expert review, inter-rater agreement metrics, or back-translation checks for the generated questions (see abstract and the benchmark construction description). Without this, model failures may reflect artifacts in question phrasing or distractors rather than true alignment deficiencies.
Authors: We agree that explicit validation procedures are essential to substantiate that the questions preserve doctrinal meaning without distortion. The manuscript describes direct extraction from primary sources (Law of War, Rules of Engagement, and Joint Ethics Regulation) followed by generation of multiple-choice items designed to retain intended meaning, organized under the OODA-informed taxonomy. However, we did not provide details on independent expert review or quantitative agreement metrics. In the revised version, we will add a dedicated subsection on benchmark construction and validation. This will include a description of the internal review process used by the author team (with relevant domain expertise), any available inter-rater checks performed during generation, and clarification that no back-translation was applied as the source material is in English. We believe this will strengthen the claim that observed failures reflect alignment gaps rather than artifacts. revision: yes
-
Referee: [Evaluation results and abstract] The abstract states that 'evaluation results reveal critical gaps' and that the benchmark features 'rigorous evaluation procedures,' but no specific performance numbers, per-model or per-category breakdowns, tables of accuracy/refusal rates, or statistical comparisons are referenced in the provided text. This makes it impossible to evaluate the magnitude, consistency, or statistical significance of the reported gaps.
Authors: The full manuscript includes an Evaluation section with detailed results: tables reporting accuracy and refusal rates for each of the 21 models, per-category breakdowns across the 12 OODA-informed categories, and overall statistics. The abstract provides a high-level summary of these findings, as is standard. To address the concern, we will revise the abstract to explicitly reference key quantitative results (e.g., overall accuracy ranges and the most pronounced gaps) while remaining concise. We will also add clearer in-text references to the specific tables and figures in the results section to improve accessibility. revision: yes
Circularity Check
No significant circularity in benchmark construction or evaluation
full rationale
The paper introduces ARMOR 2025 by extracting doctrinal text from the Law of War, Rules of Engagement, and Joint Ethics Regulation, generating 519 multiple-choice questions organized under a 12-category OODA taxonomy, and directly evaluating 21 LLMs. No equations, fitted parameters, or derived predictions appear. The benchmark is presented as a novel construction independent of prior fitted results or self-referential derivations. No load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to justify core claims. Evaluation results follow from direct testing against the constructed items without reducing to inputs by definition. This is a standard benchmark paper whose central claims rest on external doctrinal sources and new question generation rather than circular reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Extracted doctrinal text from the Law of War, Rules of Engagement, and Joint Ethics Regulation can be converted into multiple-choice questions that preserve original meaning.
- domain assumption The OODA framework provides an appropriate taxonomy for organizing military decision types for LLM testing.
Reference graph
Works this paper leans on
-
[1]
H. Zhou, C. Hu, Y . Yuan, Y . Cui, Y . Jin, C. Chen, H. Wu, D. Yuan, L. Jiang, D. Wuet al., “Large language model (llm) for telecommu- nications: A comprehensive survey on principles, key techniques, and opportunities,”IEEE Communications Surveys & Tutorials, 2024
work page 2024
-
[2]
A survey on large language models: Applications, challenges, limitations, and practical usage,
M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, N. Akhtar, J. Wu, S. Mirjaliliet al., “A survey on large language models: Applications, challenges, limitations, and practical usage,”Authorea Preprints, 2023
work page 2023
-
[3]
A Safe Harbor for AI Evaluation and Red Teaming, 2024
S. Longpre, S. Kapoor, K. Klyman, A. Ramaswami, R. Bommasani, B. Blili-Hamelin, Y . Huang, A. Skowron, Z.-X. Yong, S. Kotha, Y . Zeng, W. Shi, X. Yang, R. Southen, A. Robey, P. Chao, D. Yang, R. Jia, D. Kang, S. Pentland, A. Narayanan, P. Liang, and P. Henderson, “A safe harbor for ai evaluation and red teaming,”arXiv preprint arXiv:2403.04893, 2024
-
[4]
Air-bench 2024: A safety benchmark based on risk categories from regulations and policies,
Y . Zeng, Y . Yang, A. Zhou, J. Z. Tan, Y . Tu, Y . Mai, K. Klyman, M. Pan, R. Jia, D. Song, P. Liang, and B. Li, “Air-bench 2024: A safety benchmark based on risk categories from regulations and policies,” International Conference on Learning Representations(ICLR), 2025
work page 2024
-
[5]
Fine-tuning aligned language models compromises safety, even when users do not intend to!
X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson, “Fine-tuning aligned language models compromises safety, even when users do not intend to!”International Conference on Learning Repre- sentations (ICLR), 2024
work page 2024
-
[6]
Mart: Improving llm safety with multi-round automatic red-teaming,
S. Ge, C. Zhou, R. Hou, M. Khabsa, Y .-C. Wang, Q. Wang, J. Han, and Y . Mao, “Mart: Improving llm safety with multi-round automatic red-teaming,” inNorth American Chapter of the Association for Com- putational Linguistics (NAACL), 2024
work page 2024
-
[7]
Sorry-bench: Systematically evaluating large language model safety refusal,
T. Xie, X. Qi, Y . Zeng, Y . Huang, U. M. Sehwag, K. Huang, L. He, B. Wei, D. Li, Y . Sheng, R. Jia, B. Li, K. Li, D. Chen, P. Henderson, and P. Mittal, “Sorry-bench: Systematically evaluating large language model safety refusal,” inInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[8]
Data, analytics, and artificial intelli- gence adoption strategy,
U.S. Department of Defense, “Data, analytics, and artificial intelli- gence adoption strategy,” https://www.defense.gov/News/News-Stories/ Article/Article/3578219/dod-releases-ai-adoption-strategy/, 2023, ac- cessed: 2024-07-20
-
[9]
Executive order on removing bar- riers to american leadership in artificial intelligence,
The White House, “Executive order on removing bar- riers to american leadership in artificial intelligence,” https://www.whitehouse.gov/presidential-actions/2025/01/ removing-barriers-to-american-leadership-in-artificial-intelligence/, 2025, accessed: 2025-05-19
work page 2025
-
[10]
Department of defense law of war manual,
U.S. Department of Defense, “Department of defense law of war manual,” https://media.defense.gov/2023/Jul/31/2003271432/-1/-1/0/ DOD-LAW-OF-W AR-MANUAL-JUNE-2015-UPDATED-JULY% 202023.PDF, 2023, updated July 2023, originally published June 2015. Accessed: 2025-05-15
work page 2023
-
[11]
Law of war and rules of engagement (tbs b130936),
United States Marine Corps Training and Education Command, “Law of war and rules of engagement (tbs b130936),” https://www.trngcmd.marines.mil/Portals/207/Docs/TBS/B130936% 20Law%20of%20War%20and%20Rules%20Of%20Engagement.pdf, 2020, accessed: 2024-07-15
work page 2020
-
[12]
Joint ethics regulation (jer) dod 5500.7- r,
U.S. Department of Defense, “Joint ethics regulation (jer) dod 5500.7- r,” https://ig.army.mil/Portals/101/TIGS/DIGITAL%20LIBRARY/ Reference%20Booklets/DoD%20Directives/d55007p.pdf?ver= y0P-XjjMvONcYCKArycp1A\%3D\%3D, 1993, current as of amendments through 2011. Accessed: 2025-05-15
work page 1993
-
[13]
Holistic evaluation of language models,
P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumaret al., “Holistic evaluation of language models,”Transactions on Machine Learning Research, 2022
work page 2022
-
[14]
Truthfulqa: Measuring how models mimic human falsehoods,
S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models mimic human falsehoods,” inAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
work page 2022
-
[15]
Realtoxicityprompts: Evaluating neural toxic degeneration in language models,
S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith, “Realtoxicityprompts: Evaluating neural toxic degeneration in language models,”Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356–3369, 2020
work page 2020
-
[16]
Harmbench: a standardized evaluation framework for automated red teaming and robust refusal,
M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Liet al., “Harmbench: a standardized evaluation framework for automated red teaming and robust refusal,” inProceed- ings of the 41st International Conference on Machine Learning, 2024, pp. 35 181–35 224
work page 2024
-
[17]
Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models
L. Li, B. Dong, R. Wang, X. Hu, W. Zuo, D. Lin, Y . Qiao, and J. Shao, “Salad-bench: A hierarchical and comprehensive safety benchmark for large language models,”arXiv preprint arXiv:2402.05044, 2024
-
[18]
SafeLawBench: Towards Safe Alignment of Large Language Models,
C. Cao, H. Zhu, J. Ji, Q. Sun, Z. Zhu, Y . Wu, J. Dai, Y . Yang, S. Han, and Y . Guo, “SafeLawBench: Towards Safe Alignment of Large Language Models,” Jun. 2025. [Online]. Available: http://arxiv.org/abs/2506.06636
-
[19]
Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain,
D. C. Ruiz and J. Sell, “Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain,” Oct. 2024. [Online]. Available: http://arxiv.org/abs/2410.20297
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.