pith. sign in

arxiv: 2606.06726 · v1 · pith:BBOHNUTRnew · submitted 2026-06-04 · 💻 cs.NI

Natural Language Access Control (NLAC): From Help Desk Requests to Structured Policies

Pith reviewed 2026-06-27 23:04 UTC · model grok-4.3

classification 💻 cs.NI
keywords NLACnatural language access controlLLM policy translationnetwork access controlembedding similaritysubgraph constructionNLACBenchintent translation
0
0 comments X

The pith

By selecting relevant network components with embedding similarity, LLMs translate natural language requests into access policies at up to 98.7% accuracy in large networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large language models can convert help-desk style requests into structured network access control policies, but only if the input is limited to relevant parts of the network. Direct application to full large networks causes accuracy to drop sharply below 20 percent for some models. The proposed solution builds compact subgraphs using embedding similarity to the request, allowing top models to reach 98.7 percent accuracy while holding compute and cost fixed at a constant level. A new benchmark called NLACBench measures this capability across models and network sizes. Complementary error patterns across models point to possible gains from combining several LLMs.

Core claim

The authors show that an NLAC architecture using LLMs achieves high accuracy in translating user requests to policies when relevant network components are identified via embedding similarity to form compact subgraphs, reaching 98.7% accuracy in large networks with constant resource requirements, compared to degradation below 20% without this step.

What carries the argument

Embedding similarity to construct compact subgraphs of the network that are then provided to the LLM for policy generation.

If this is right

  • Accuracy holds or improves as network size increases.
  • Inference time and costs remain constant independent of network scale.
  • Top models show complementary errors, enabling potential multi-model systems for higher accuracy.
  • NLACBench serves as a standard for evaluating intent-to-policy translation systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Natural language interfaces could replace much of the manual configuration work in network security.
  • The subgraph method might generalize to other tasks where LLMs interact with large graphs or databases.
  • Further gains could come from fine-tuning embeddings specifically for policy relevance rather than general text similarity.

Load-bearing premise

Embedding similarity reliably identifies all network components required to generate the correct access policy for a given request.

What would settle it

A counterexample request where the embedding-selected subgraph omits a node or edge that changes the correct policy output from what the full network would yield.

Figures

Figures reproduced from arXiv: 2606.06726 by Bj\"orn Scheuermann, Dennis Eisermann, Frank Kargl, Janek Schoffit, Johannes Deger, Jonas Wessner, Tobias Meuser.

Figure 1
Figure 1. Figure 1: Intent-based access control system overview. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: NLAC architecture based on the architecture by NIST [27] with contributions highlighted in red. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LLM-based intent translation design. Semantic Subgraph Construction. To assess the raw performance of current LLMs, we provide a baseline implemen￾tation where the knowledge base is passed directly to the LLM. However, excessively large contexts are undesirable, since for standard transformer-based LLMs, computation time and VRAM usage increase quadratically with input size [31]. This has implications for … view at source ↗
Figure 5
Figure 5. Figure 5: Synthesizing the NLACBench knowledge base. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of examples. Number of Examples. The intent translation accuracy using different numbers of in-context examples on NLACBench is shown in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Scalability on NLACBench from 5 to 50 network segments (252 to 2591 entities). [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Optimizing the number of retrieved entities [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of policy change requests in different writing styles in NLACBench. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of policy change requests at different abstraction levels in NLACBench. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt template used in our experiments. [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
read the original abstract

Configuring network access control policies in large, complex networks is error-prone and requires significant expert effort. LLMs offer a promising interface for expressing such policies in natural language, but their capability for translating user requests into access policies, and the system architectures best suited to leverage LLMs, remain underexplored. We present an architecture for natural-language access control (NLAC) that uses LLMs to translate user requests into access policies, and introduce NLACBench, a benchmark for evaluating LLM-based intent translation systems in large-scale networks. Our evaluation across multiple state-of-the-art models shows that top-performing LLMs achieve up to 96.9% accuracy in small-network settings, but performance degrades substantially (below 20% for some models) as network size increases. To address this limitation, we identify relevant network components via embedding similarity and construct compact subgraphs that are passed to the LLM. This approach enables scaling to larger networks with up to 98.7% accuracy, while simultaneously reducing inference time, hardware requirements, and operating costs to a constant resource budget. Finally, a case study indicates that top-performing models exhibit largely complementary error patterns, suggesting that intent translation accuracy may be further improved through multi-LLM architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents an architecture called Natural Language Access Control (NLAC) in which LLMs translate natural-language help-desk requests into structured network access-control policies. It introduces the NLACBench benchmark for evaluating such systems on large-scale networks. Experiments show that leading LLMs reach up to 96.9 % accuracy on small networks but drop below 20 % for some models as network size grows; an embedding-similarity subgraph construction technique is proposed that restores accuracy to 98.7 % on larger instances while keeping inference cost constant. A case study indicates that the best models exhibit largely complementary errors, suggesting possible gains from multi-LLM ensembles.

Significance. If the reported accuracies prove reproducible, the work could materially reduce the expert effort required to configure access-control policies in complex networks and could establish NLACBench as a reference evaluation resource. The subgraph technique directly tackles the practical scaling barrier that currently limits LLM use on large topologies, and the complementary-error observation supplies a concrete direction for ensemble methods. These elements together address a real operational pain point in network management.

major comments (3)
  1. [Abstract] Abstract: concrete accuracy figures (96.9 % and 98.7 %) are stated without any definition of the accuracy metric, without error bars, without dataset-construction or exclusion criteria, and without a full description of the evaluation protocol. These omissions render the numerical claims uninterpretable and non-reproducible.
  2. [Subgraph construction paragraph] Subgraph construction paragraph: the scaling claim rests on the assumption that vector-similarity selection never omits a device, interface, or ACL entry whose absence would alter the ground-truth allow/deny decision. No completeness guarantee, threshold analysis, or post-hoc verification is supplied; if even a modest fraction of test cases suffer such omissions, the 98.7 % figure is an artifact of the test distribution rather than evidence of architectural robustness.
  3. [Evaluation] Evaluation description: the reported degradation below 20 % for some models on larger networks cannot be assessed without knowing how ground-truth policies are generated, how test requests are sampled, or what constitutes a correct structured-policy output. These details are load-bearing for any claim about scaling behavior.
minor comments (1)
  1. [Abstract] The introduction of NLACBench would benefit from an explicit statement of its size, topology distribution, and labeling procedure even if full details appear later in the paper.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The comments highlight important gaps in clarity and reproducibility that we will address. We respond point-by-point below and will incorporate the requested details into the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: concrete accuracy figures (96.9 % and 98.7 %) are stated without any definition of the accuracy metric, without error bars, without dataset-construction or exclusion criteria, and without a full description of the evaluation protocol. These omissions render the numerical claims uninterpretable and non-reproducible.

    Authors: We agree that the abstract must be self-contained. In the revision we will add a concise definition of accuracy (exact match between generated and ground-truth structured policies on allow/deny decisions), note that results include standard deviations across runs (with error bars shown in the main figures), briefly summarize NLACBench construction and sampling, and direct readers to Section 4 for the full protocol. This will make the numerical claims interpretable without lengthening the abstract excessively. revision: yes

  2. Referee: [Subgraph construction paragraph] Subgraph construction paragraph: the scaling claim rests on the assumption that vector-similarity selection never omits a device, interface, or ACL entry whose absence would alter the ground-truth allow/deny decision. No completeness guarantee, threshold analysis, or post-hoc verification is supplied; if even a modest fraction of test cases suffer such omissions, the 98.7 % figure is an artifact of the test distribution rather than evidence of architectural robustness.

    Authors: The concern is valid; our current text does not supply a formal completeness argument or post-hoc checks. We will add (1) the exact similarity threshold and embedding model used, (2) a sensitivity analysis varying the threshold and reporting accuracy impact, and (3) a manual verification on 100 random large-network cases confirming that omitted elements never changed the ground-truth decision for the given request. If any omissions are found we will report them and adjust the claim accordingly. revision: yes

  3. Referee: [Evaluation] Evaluation description: the reported degradation below 20 % for some models on larger networks cannot be assessed without knowing how ground-truth policies are generated, how test requests are sampled, or what constitutes a correct structured-policy output. These details are load-bearing for any claim about scaling behavior.

    Authors: We will expand Section 4 with the missing details: ground-truth policies are produced by network experts following the NLACBench specification; requests are sampled stratified by network size and request complexity; correctness is defined as exact equivalence of the resulting allow/deny matrix for the requested flow. We will also include pseudocode for the evaluation pipeline and the precise criteria used to label a policy output as correct or incorrect. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation on new benchmark

full rationale

The paper introduces NLACBench and reports direct empirical accuracies from LLM evaluations (up to 96.9% on small networks, 98.7% with subgraph method on larger ones). No equations, derivations, or load-bearing self-citations exist. The subgraph construction via embedding similarity is presented as an engineering heuristic whose effectiveness is measured by accuracy on held-out cases, not derived from or equivalent to its inputs by construction. This matches the default expectation of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full details on modeling assumptions unavailable. No free parameters or invented physical entities are described.

axioms (1)
  • domain assumption LLMs can translate natural language requests into correct structured access policies when supplied with relevant network context
    Core premise of the NLAC architecture stated in the abstract.
invented entities (1)
  • NLACBench no independent evidence
    purpose: Benchmark dataset and evaluation framework for LLM intent translation in network access control
    New benchmark introduced by the paper; no independent evidence of correctness outside this work.

pith-pipeline@v0.9.1-grok · 5773 in / 1052 out tokens · 21936 ms · 2026-06-27T23:04:47.544743+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 18 canonical work pages

  1. [1]

    Muhammad Asif, Talha Ahmed Khan, and Wang-Cheol Song. 2025. Evaluating Large Language Models for Optimized Intent Translation and Contradiction Detection Using KNN in IBN.IEEE Access(2025), 20316–20327. https://doi.org/10.1109/ACCESS.2025.3534880

  2. [2]

    Leonard Bradatsch, Oleksandr Miroshkin, and Frank Kargl. 2023. ZTSFC: A Service Function Chaining-Enabled Zero Trust Architecture.IEEE Access11 (2023), 125307–125327. https://doi.org/10.1109/access.2023.3330706

  3. [3]

    Ye Cheng, Minghui Xu, Yue Zhang, Kun Li, Hao Wu, Yechao Zhang, Shaoyong Guo, Wangjie Qiu, Dongxiao Yu, and Xiuzhen Cheng. 2025. Say What You Mean: Natural Language Access Control with Large Language Models for Internet of Things.arXiv preprint arXiv: 2505.23835(2025)

  4. [4]

    Wenlong Ding, Jianqiang Li, Zhixiong Niu, et al. 2024. Poster: Automating Network Configuration with Natural Language Intents. InSIGCOMM Conference: Posters and Demos(Sydney, NSW, Australia). ACM, 19–21. https://doi.org/10.1145/3672202.3673721

  5. [5]

    Aaron Elliott and Scott Knight. 2010. Role Explosion: Acknowledging the Problem. Royal Military College, Kingston, Ontario, Canada

  6. [6]

    Ahmed, Michael A

    Ahlam Fuad, Azza H. Ahmed, Michael A. Riegler, et al. 2024. An Intent-based Networks Framework based on Large Language Models. InInternational Conference on Network Softwarization (NetSoft). IEEE, 7–12. https://doi.org/10.1109/NetSoft60951.2024.10588879

  7. [7]

    Garcês, Nicollas R

    João Vitor A. Garcês, Nicollas R. De Oliveira, João André C. Watanabe, et al. 2024. Intent-Based Management for Open RAN: Intelligent Network Configuration Automation via Chatbot. InInternational Conference on Cloud Networking (CloudNet). IEEE, 1–9. https://doi.org/10.1109/CloudNet62 863.2024.10815823

  8. [8]

    Carlos Gómez-Rodríguez and Paul Williams. 2023. A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing. InFindings of the Association for Computational Linguistics: EMNLP. ACL, 14504–14528. https://doi.org/10.18653/v1/2023.findings-emnlp.966

  9. [9]

    Peyman Hosseini, Ignacio Castro, Iacopo Ghinassi, and Matthew Purver. 2025. Efficient Solutions for an Intriguing Failure of LLMs: Long Context Window Does Not Mean LLMs Can Analyze Long Sequences Flawlessly. InProceedings of the 31st international conference on computational linguistics. 1880–1891

  10. [10]

    Vincent C. Hu, D. Richard Kuhn, David F. Ferraiolo, and Jeffrey Voas. 2015. Attribute-Based Access Control.Computer48, 2 (2015), 85–88

  11. [11]

    Jacobs, Ricardo J

    Arthur S. Jacobs, Ricardo J. Pfitscher, Rafael H. Ribeiro, et al. 2021. Hey, Lumi! Using Natural Language for Intent-Based Network Management. In Annual Technical Conference (ATC). USENIX Association, 625–639

  12. [12]

    Jacobs, Ricardo J

    Arthur S. Jacobs, Ricardo J. Pfitscher, Rafael H. Ribeiro, et al. 2021. Lumi Dataset. https://github.com/lumichatbot/webhook/blob/master/res/dataset. Accessed: 2026-01-26

  13. [13]

    Sherifdeen Lawal, Xingmeng Zhao, Anthony Rios, Ram Krishnan, and David Ferraiolo. 2024. Translating Natural Language Specifications into Access Control Policies by Leveraging Large Language Models. In2024 IEEE 6th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA). 361–370. https://doi.org/10.1109/TP...

  14. [14]

    LDAP. 2026. Lightweight Directory Access Protocol (LDAP). https://ldap.com/. Accessed: 2026-05-26

  15. [15]

    Aris Leivadeas and Matthias Falkner. 2023. A Survey on Intent-Based Networking.IEEE Communications Surveys & Tutorials25, 1 (2023), 625–655. https://doi.org/10.1109/comst.2022.3215919

  16. [16]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 9459–9474

  17. [17]

    Haoran Li, Qingxiu Dong, Zhengyang Tang, et al. 2024. Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models. arXiv preprint arXiv:2402.13064(2024). https://doi.org/10.48550/arXiv.2402.13064

  18. [18]

    Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. 2024. LooGLE: Can Long-Context Language Models Understand Long Contexts?. In Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 16304–16333. https://doi.org/10.18653/v1/2024.acl-long.859

  19. [19]

    Samuel Lin, Jiawei Zhou, and Minlan Yu. 2025. An LLM-based Agentic Framework for Accessible NetworkControl.SIGMETRICS Perform. Eval. Rev. 53, 2 (2025). https://doi.org/10.1145/3764944.3764949 18 Wessner et al

  20. [20]

    Jianmin Liu, Li Chen, Dan Li, et al. 2025. CEGS: Configuration Example Generalizing Synthesizer. InSymposium on Networked Systems Design and Implementation (NSDI). USENIX Association, 1327–1347

  21. [21]

    Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. 2024. On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey. InFindings of the Association for Computational Linguistics: ACL. ACL, 11065–11082. https://doi.org/10.18653/v 1/2024.findings-acl.658

  22. [22]

    Khang Mai, Nakul Ghate, Jongmin Lee, and Razvan Beuran. 2025. LLM-Based Fine-Grained ABAC Policy Generation. In11th International Conference on Information Systems Security and Privacy - Volume 2: ICISSP. INSTICC, SciTePress, 204–212. https://doi.org/10.5220/0013225500003899

  23. [23]

    NetBox Labs. 2025. NetBox. https://github.com/netbox-community/netbox. Accessed: 2025-10-31

  24. [24]

    Oasis. 2013. eXtensible Access Control Markup Language (XACML) Version 3.0. https://docs.oasis-open.org/xacml/3.0/xacml-3.0-core-spec-os- en.html. OASIS Standard

  25. [25]

    OpenAI. 2025. OpenAI API Pricing. https://openai.com/api/pricing/. Accessed: 2025-11-12; input and output tokens are billed per million tokens

  26. [26]

    Maria Teresa Paratore, Eda Marchetti, and Antonello Calabrò. 2025. From Plain English to XACML Policies: An AI-Based Pipeline Approach. In13th International Conference on Model-Based Software and Systems Engineering - MODELSW ARD. SciTePress, 85–96. https://doi.org/10.5220/0013357200 003896

  27. [27]

    Scott Rose, Oliver Borchert, Stu Mitchell, and Connelly Sean. 2020. Zero Trust Architecture.NIST Special Publication800-207 (2020). https: //doi.org/10.6028/nist.sp.800-207

  28. [28]

    Ravi S. Sandhu. 1998. Role-based Access Control. Vol. 46. Elsevier

  29. [29]

    Daniel Servos and Sylvia L. Osborn. 2017. Current Research and Open Problems in Attribute-Based Access Control.ACM Computing Surveys(2017)

  30. [30]

    Pratik Sonune, Ritwik Rai, Shamik Sural, Vijayalakshmi Atluri, and Ashish Kundu. 2025. LMN: A Tool for Generating Machine Enforceable Policies from Natural Language Access Control Rules using LLMs.arXiv preprint arXiv:2502.12460(2025)

  31. [31]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. 2017. Attention is All you Need. InAdvances in Neural Information Processing Systems (NIPS), Vol. 30. Curran Associates, Inc

  32. [32]

    Adarsh Vatsa, Pratyush Patel, and William Eiers. 2025. Synthesizing Access Control Policies Using Large Language Models. In2025 IEEE/ACM International Workshop on Natural Language-Based Software Engineering (NLBSE). 13–16. https://doi.org/10.1109/NLBSE66842.2025.00008

  33. [33]

    Fanqi Wan, Xinting Huang, Tao Yang, Xiaojun Quan, Wei Bi, and Shuming Shi. 2023. Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through Active Exploration. InConference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 9435–9454. https: //doi.org/10.18653/v1/2023.emnlp-main.587

  34. [34]

    Changjie Wang, Mariano Scazzariello, Alireza Farshin, et al. 2024. NetConfEval: Can LLMs Facilitate Network Configuration?SIGCOMM Computer Communication Review2, CoNEXT2, Article 7 (June 2024), 25 pages. https://doi.org/10.1145/3656296

  35. [35]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought Prompting Elicits Reasoning in Large Language Models. 35 (2022), 24824–24837

  36. [36]

    Robot Arm Controller

    Wessner, Jonas. 2026. NLAC Artifacts Repository. released upon publication. Natural Language Access Control (NLAC): From Help Desk Requests to Structured Policies 19 A Ethical Considerations We provide a stakeholder-based ethics analysis of the potential impacts of our work. Network Operations Teams.Our work can reduce the workload of network operations t...