arxiv: 2605.07472 · v1 · submitted 2026-05-08 · 💻 cs.CR · cs.AI· cs.MA

Recognition: 2 theorem links

· Lean Theorem

HBEE: Human Behavioral Entropy Engine -- Pre-Registered Multi-Agent LLM Simulation of Peer-Suspicion-Based Detection Inversion

Vickson Ferrel

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:56 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.MA

keywords insider threat detectionmulti-agent simulationLLM agentspeer suspicion graphadaptive OPSECdetection inversionUEBAbehavioral simulation

0 comments

The pith

In an LLM-driven multi-agent simulation, an adaptive insider ends up with lower peer suspicion than innocent agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the assumption that adaptive insiders leave detectable behavioral traces by running a pre-registered multi-agent simulation where LLM agents act as colleagues and one acts as an insider. It crosses defender modes with naive versus adaptive adversary behaviors across many runs and measures two signals: peer suspicion in-degree and UEBA rank. The main result is that the adaptive insider receives lower suspicion than a random innocent agent at the 60-minute mark, with a large effect size in the opposite direction from the pre-registered prediction. The two signals also decouple, and the simulation's communication pattern fails a pre-registered match to real email data. This matters because it shows how adaptive evasion can invert a common detection approach even in a controlled setting.

Core claim

The pre-registered study isolates defender mode and adversary type and finds a detection inversion at T_60: the adaptive mole's suspicion in-degree is lower than that of a randomly selected innocent agent (Cliff's delta = -0.694). Equivalence tests show no shift in the mole's UEBA rank under either defender mode. The peer-suspicion graph and UEBA signals decouple under adaptive OPSEC. A pre-registered Gini calibration check fails because the simulator's message-exposure distribution diverges from the SNAP Enron reference.

What carries the argument

The HBEE multi-agent simulator in which LLM agents exchange messages, report suspicions, and generate in-degree scores on a peer graph while one agent follows adaptive OPSEC directives.

If this is right

Peer-suspicion cascade detection can invert against an adaptive adversary.
UEBA rank and peer suspicion become independent signals when the adversary adapts.
Adaptive OPSEC produces no detectable change in the insider's UEBA rank.
Generalization beyond the simulator is bounded because the message-exposure distribution does not match real organizational email data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Detection systems may need separate mechanisms for AI-assisted adaptive evasion rather than relying on either signal alone.
Simulator-based testing can surface counterintuitive failure modes before deployment in live environments.
Improving the match between simulated communication patterns and empirical data would strengthen claims about real-world applicability.

Load-bearing premise

LLM directives can produce human-like adaptive operational security behavior and the resulting message graph creates suspicion dynamics comparable to real organizations.

What would settle it

Replace the LLM agents with human participants instructed to follow the same adaptive OPSEC rules and re-measure whether the adaptive insider still receives lower suspicion in-degree than innocents.

Figures

Figures reproduced from arXiv: 2605.07472 by Vickson Ferrel.

**Figure 2.** Figure 2: Empirical CDF of per-run mole rank in the C2 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: H3 detection inversion (paper’s primary empirical [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: H2 equivalence forest plot. Each entry shows the mean paired rank difference (treatment [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Behavioral fidelity rating (1–5 Likert, author-rated) per sampled mole turn, plotted against tick of occurrence ( [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Lorenz curves of communication concentration in [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Insider threat detection assumes that an adaptive insider leaves behavioral residue distinguishing them from legitimate users. We test this assumption against an LLM-driven adaptive insider in a controlled multi-agent simulator. Our pre-registered five-condition study isolates defender mode (cascade vs. blind UEBA) crossed with adversary type (naive vs. adaptive OPSEC) plus a no-mole control, across 100 runs (95 valid after pre-committed exclusions). The primary finding is a detection inversion: at T_60, the adaptive mole's suspicion in-degree is statistically lower than a randomly selected innocent agent (Cliff's delta = -0.694, 95% BCa CI [-0.855, -0.519], Mann-Whitney p << 0.01). The pre-registered prediction was the opposite direction. A pre-registered equivalence test (H2) shows adaptive OPSEC produces no detectable shift in the mole's UEBA rank under either defender mode. The two detection signals (peer suspicion graph in-degree and per-agent UEBA rank) decouple under adaptive adversary behavior. We bound generalization explicitly: a pre-registered Gini calibration check (H4) returns FAIL, with HBEE pairwise message-exposure Gini (0.213) diverging from the SNAP Enron reference (0.730) by |Delta Gini| = 0.52, exceeding the equivalence bound by 5x. The paper makes a narrow but surprising claim: in a controlled environment where adaptive OPSEC is implementable as an LLM directive, peer-suspicion-cascade detection inverts. We release the simulator, pre-registration document, frozen scenarios, raw telemetry, and analysis pipeline under an open-source license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pre-registered LLM simulation reports peer-suspicion inversion under adaptive OPSEC, but the model fails its own calibration against real organizational data.

read the letter

The core takeaway is that this paper runs a pre-registered five-condition simulation and finds the opposite of its directional prediction: at T_60 the adaptive mole shows lower suspicion in-degree than a random innocent (Cliff's delta -0.694). It also reports decoupling between the peer-suspicion graph and UEBA ranks under adaptive behavior. That result is new in the cited literature and comes with explicit bounds plus open artifacts, which is more than most simulation papers deliver. The pre-registration, equivalence tests, confidence intervals, and release of the simulator, scenarios, and pipeline are genuine strengths; they let a reader check the numbers directly. The statistical reporting on the inversion itself looks clean on the reported metrics. The soft spots are exactly where the stress-test note flags them. The pre-registered Gini calibration (H4) fails badly: HBEE message-exposure Gini is 0.213 against the SNAP Enron reference of 0.730, missing the equivalence bound by five times. That single check shows the simulated communication graph does not reproduce real organizational exposure patterns, so the suspicion dynamics are likely an artifact of the graph-construction rules or the LLM prompting rather than a general behavioral finding. There is also no independent validation that the LLM directives produce human-comparable adaptive OPSEC. The five-run exclusion is minor and pre-committed, but it does not fix the calibration gap. This paper is for researchers who build or evaluate insider-threat simulations and UEBA systems; anyone testing detection assumptions with multi-agent models will find the pre-reg and artifact release useful even if they treat the inversion as a cautionary case rather than a general result. It deserves a serious referee to examine the simulation mechanics and whether the bounded claims are stated tightly enough. I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper reports a pre-registered five-condition multi-agent LLM simulation (HBEE) of insider-threat detection, crossing defender mode (cascade vs. blind UEBA) with adversary type (naive vs. adaptive OPSEC) plus control, across 100 runs (95 valid). The primary result is a detection inversion at T_60: the adaptive mole's suspicion in-degree is lower than a random innocent agent's (Cliff's delta = -0.694, 95% BCa CI [-0.855, -0.519], Mann-Whitney p << 0.01), contrary to the pre-registered directional prediction. Equivalence testing shows no UEBA-rank shift under adaptive OPSEC; the two signals decouple. A pre-registered Gini calibration (H4) against SNAP Enron fails (|Delta Gini| = 0.52, 5x over bound), and the authors explicitly bound generalization while releasing the simulator, pre-registration, data, and pipeline.

Significance. If the simulation faithfully reproduces human-like adaptive OPSEC and suspicion dynamics, the inversion and decoupling would challenge the core assumption that adaptive insiders necessarily produce detectable behavioral residue in peer-suspicion graphs, with potential implications for UEBA and cascade detectors. Strengths include the pre-registration, explicit equivalence testing, confidence intervals, and open release of code and materials. However, the result's bearing on real insider-threat detection is limited by the unvalidated LLM fidelity and the large divergence from the Enron reference graph.

major comments (2)

[H4 calibration check] H4 calibration check: |Delta Gini| = 0.52 (HBEE 0.213 vs. Enron 0.730) exceeds the pre-registered equivalence bound by 5x. Because the headline suspicion in-degree outcome is generated from the pairwise message-exposure graph, this divergence means the observed inversion may not reflect dynamics comparable to real organizations.
[Methods on LLM prompt templates for adaptive OPSEC] Methods on LLM prompt templates for adaptive OPSEC: the central statistical claim (lower suspicion in-degree for the adaptive mole) rests on the untested assumption that the LLM directives produce human-comparable OPSEC behavior. No human-subject validation, ablation on prompt variants, or external behavioral benchmark is reported to rule out simulation artifact.

minor comments (1)

[Abstract] The abstract states the pre-registered prediction was opposite the observed inversion but does not quote the exact directional hypothesis; adding the verbatim pre-registered wording would improve transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback on our pre-registered simulation study. We address each major comment below. The manuscript already explicitly reports the H4 calibration failure and bounds its claims accordingly; we make no broader assertions about real-world organizations.

read point-by-point responses

Referee: [H4 calibration check] H4 calibration check: |Delta Gini| = 0.52 (HBEE 0.213 vs. Enron 0.730) exceeds the pre-registered equivalence bound by 5x. Because the headline suspicion in-degree outcome is generated from the pairwise message-exposure graph, this divergence means the observed inversion may not reflect dynamics comparable to real organizations.

Authors: We agree that the |Delta Gini| = 0.52 divergence shows the HBEE message-exposure graph does not match the Enron reference structure. This is precisely why H4 was pre-registered as a calibration check and why the manuscript states that the check fails, explicitly bounding generalization to the controlled LLM simulation rather than claiming equivalence to real organizations. The core result is the detection inversion observed inside this reproducible environment, supported by open release of the simulator, prompts, data, and pipeline. No changes to the manuscript are needed because the limitation is already disclosed in the abstract, results, and discussion. revision: no
Referee: [Methods on LLM prompt templates for adaptive OPSEC] Methods on LLM prompt templates for adaptive OPSEC: the central statistical claim (lower suspicion in-degree for the adaptive mole) rests on the untested assumption that the LLM directives produce human-comparable OPSEC behavior. No human-subject validation, ablation on prompt variants, or external behavioral benchmark is reported to rule out simulation artifact.

Authors: We acknowledge that the adaptive OPSEC behavior is implemented via LLM directives without human-subject validation, prompt ablations, or external behavioral benchmarks. The study is framed as a controlled, pre-registered simulation to isolate the effect of adaptive OPSEC directives on detection signals, with all materials released to support independent scrutiny or extension. We have added a clarifying sentence in the limitations section noting the absence of human validation and the consequent need for caution in extrapolating to human insiders. This addresses the comment while preserving the narrow scope of the pre-registered claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity: simulation outcomes and pre-registered checks are independent of internal fits or self-citations

full rationale

The paper reports results from a pre-registered multi-agent LLM simulation study with explicit statistical comparisons (Cliff's delta, Mann-Whitney) and a failed external calibration check (H4 Gini against SNAP Enron). No equations, parameters, or derivations reduce by construction to the target claims; the inversion finding is an empirical outcome from the runs, not a fitted input renamed as prediction. No self-citations are load-bearing for the central claim, no ansatz is smuggled, and no uniqueness theorems or renamings of known results appear. The explicit bounding via calibration failure further separates the narrow simulation claim from any internal circular loop. The derivation chain is self-contained against the simulation telemetry.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that LLM agents can implement adaptive OPSEC via prompt directives and that in-degree in the suspicion graph measures real peer suspicion. No free parameters are explicitly fitted to the target result; the 100-run design and pre-committed exclusions are stated as fixed. The HBEE framework is introduced as the simulation engine but lacks independent falsifiable evidence beyond the current runs.

free parameters (1)

LLM prompt templates for adaptive OPSEC
The adaptive behavior is implemented as an LLM directive whose exact wording and temperature settings are not detailed in the abstract.

axioms (2)

domain assumption LLM agents can simulate human behavioral entropy and adaptive OPSEC
Invoked to justify the adversary type contrast between naive and adaptive moles.
domain assumption Suspicion in-degree and UEBA rank are valid, independent detection signals
Used to interpret the decoupling result and the primary statistical test.

invented entities (1)

HBEE (Human Behavioral Entropy Engine) no independent evidence
purpose: Framework for generating and measuring behavioral patterns in the multi-agent simulation
Introduced as the core simulation engine; no external validation or falsifiable prediction outside the current study is provided.

pith-pipeline@v0.9.0 · 5616 in / 1508 out tokens · 53376 ms · 2026-05-11T01:56:04.652344+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

primary finding is a detection inversion: at T_60, the adaptive mole's suspicion in-degree is statistically lower than a randomly selected innocent agent (Cliff's delta = -0.694)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Insider threat test dataset (r4.2), 2016

Carnegie Mellon University CERT Division. Insider threat test dataset (r4.2), 2016

work page 2016
[2]

Dominance statistics: Ordinal analyses to answer ordinal questions.Psychological Bulletin, 114(3):494–509, 1993

Norman Cliff. Dominance statistics: Ordinal analyses to answer ordinal questions.Psychological Bulletin, 114(3):494–509, 1993

work page 1993
[3]

GLM: General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM: General language model pretraining with autoregressive blank infilling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022

work page 2022
[4]

Tibshirani.An Introduction to the Bootstrap

Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Number 57 in Monographs on Statis- tics and Applied Probability. Chapman & Hall/CRC, 1993

work page 1993
[5]

HBEE V40 deployment risk addendum,

Vickson Ferrel. HBEE V40 deployment risk addendum,

work page
[6]

Operational deployment guidance; available at https://github.com/Vix0007/hbee-v40

work page
[7]

HBEE V40 deviation log (V40_DEVIATIONS.md), 2026

Vickson Ferrel. HBEE V40 deviation log (V40_DEVIATIONS.md), 2026. DEV-1 through DEV-6; available at https://github.com/Vix0007/ hbee-v40

work page 2026
[8]

HBEE V40 Gini saturation-ceiling diag- nostic (gini_diagnostic.py), 2026

Vickson Ferrel. HBEE V40 Gini saturation-ceiling diag- nostic (gini_diagnostic.py), 2026. Supplementary diagnostic for DEV-4; available at https://github. com/Vix0007/hbee-v40

work page 2026
[9]

HBEE V40 pre-registration protocol (V40_PREREG.md), 2026

Vickson Ferrel. HBEE V40 pre-registration protocol (V40_PREREG.md), 2026. Frozen by SHA256 prior to first campaign run; available at https://github.com/ Vix0007/hbee-v40

work page 2026
[10]

OrgForge-IT: A verifiable synthetic bench- mark for LLM-based insider threat detection, 2026

Jeffrey Flynt. OrgForge-IT: A verifiable synthetic bench- mark for LLM-based insider threat detection, 2026

work page 2026
[11]

John J. Horton. Large language models as simulated eco- nomic agents: What can we learn from Homo Silicus? Technical Report 31122, National Bureau of Economic Research, 2023

work page 2023
[12]

Gonza- lez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with Page- dAttention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP), 2023

work page 2023
[13]

Equivalence tests: A practical primer for t tests, correlations, and meta-analyses.Social Psycho- logical and Personality Science, 8(4):355–362, 2017

Daniël Lakens. Equivalence tests: A practical primer for t tests, correlations, and meta-analyses.Social Psycho- logical and Personality Science, 8(4):355–362, 2017

work page 2017
[14]

SNAP datasets: Stan- ford large network dataset collection, 2014

Jure Leskovec and Andrej Krevl. SNAP datasets: Stan- ford large network dataset collection, 2014. Email- Enron network used as H4 calibration reference. 13

work page 2014
[15]

Ritchie, Sören Mindermann, Evan Hubinger, Ethan Perez, and Kevin K

Aengus Lynch, Benjamin Wright, Caleb Larson, Stu- art J. Ritchie, Sören Mindermann, Evan Hubinger, Ethan Perez, and Kevin K. Troy. Agentic misalignment: How LLMs could be insider threats, 2025

work page 2025
[16]

Nosek, Emorie D

Brian A. Nosek, Emorie D. Beck, Lorne Campbell, Jes- sica K. Flake, Tom E. Hardwicke, David T. Mellor, Anna E. van ’t Veer, and Simine Vazire. Preregistration is hard, and worthwhile.Trends in Cognitive Sciences, 23(10):815–818, 2019

work page 2019
[17]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Tech- nology (UIST), 2023

work page 2023
[18]

AgentSociety: Large-scale simulation of LLM-driven generative agents advances understanding of human behaviors and society, 2025

Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, Chen Gao, Fengli Xu, Fang Zhang, Ke Rong, Jun Su, and Yong Li. AgentSociety: Large-scale simulation of LLM-driven generative agents advances understanding of human behaviors and society, 2025

work page 2025
[19]

Improving repro- ducibility in machine learning research (A report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving repro- ducibility in machine learning research (A report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021

work page 2019
[20]

User and entity behavior analytics for enterprise secu- rity

Madhu Shashanka, Min-Yi Shen, and Jisheng Wang. User and entity behavior analytics for enterprise secu- rity. InProceedings of the 2016 IEEE International Conference on Big Data, 2016

work page 2016
[21]

Audit-LLM: Multi- agent collaboration for log-based insider threat detection, 2024

Chengyu Song, Linru Ma, Jianming Zheng, Jinzhi Liao, Hongyu Kuang, and Lin Yang. Audit-LLM: Multi- agent collaboration for log-based insider threat detection, 2024

work page 2024
[22]

ChatGLM: A family of large language models from GLM-130B to GLM-4 All Tools, 2024

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chen- hui Zhang, Da Yin, et al. ChatGLM: A family of large language models from GLM-130B to GLM-4 All Tools, 2024

work page 2024
[23]

Simulating social media using large language models to evaluate alternative news feed algo- rithms, 2023

Petter Törnberg, Diliara Valeeva, Justus Uitermark, and Christopher Bail. Simulating social media using large language models to evaluate alternative news feed algo- rithms, 2023

work page 2023
[24]

Multimodal safety evaluation in generative agent social simulations, 2025

Alhim Vera, Karen Sanchez, Carlos Hinojosa, Haidar Bin Hamid, Donghoon Kim, and Bernard Ghanem. Multimodal safety evaluation in generative agent social simulations, 2025

work page 2025
[25]

Chimera: Harnessing multi-agent LLMs for automatic insider threat simulation

Jiongchi Yu, Yuhan Ma, Xiaofei Xie, Qiang Hu, and Ziming Zhao. Chimera: Harnessing multi-agent LLMs for automatic insider threat simulation. InProceedings of the 33rd Network and Distributed System Security Symposium (NDSS), 2026. arXiv:2508.07745

work page arXiv 2026
[26]

Deep learning for insider threat detection: Review, challenges and opportunities

Shuhan Yuan and Xintao Wu. Deep learning for insider threat detection: Review, challenges and opportunities. Computers & Security, 104:102221, 2021. 14

work page 2021