Linguistic Firewall: Geometry as Defense in Multi-Agent Systems Routing

Adar Peleg; Amit Levi; Avi Mendelson; Ben Hagag; Dvir Alsheich; Rom Himelstein

arxiv: 2606.30555 · v1 · pith:FUVTTLASnew · submitted 2026-06-29 · 💻 cs.AI · cs.MA

Linguistic Firewall: Geometry as Defense in Multi-Agent Systems Routing

Dvir Alsheich , Adar Peleg , Ben Hagag , Rom Himelstein , Amit Levi , Avi Mendelson This is my paper

Pith reviewed 2026-06-30 05:55 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords multi-agent systemsagent routingLLM securityinjection attacksempirical evaluationalgebraic projectionsemantic space

0 comments

The pith

ANTAP routes multi-agent tasks by algebraically projecting empirical capability tests instead of agent descriptions, creating a linguistic firewall against injection attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current routers in multi-agent LLM systems depend on unverified textual descriptions or static embeddings, which malicious agents can falsify. ANTAP instead actively tests agents to measure real competencies and condenses those results into fixed behavioral operators inside a shared semantic space. Routing then reduces to a non-textual algebraic projection that cannot express or be fooled by metadata lies. Experiments report near-zero attack success against description-based injections and a 20 percent reduction against adaptive embedding attacks. A reader would care because this removes the gap between claimed and actual agent abilities that existing proxy methods leave open.

Core claim

ANTAP is an evaluation-driven routing architecture that discards indirect proxies in favor of active capability testing. By dynamically querying agents to ascertain their true competencies empirically, ANTAP distills performance into fixed behavioral operators within a shared semantic space. At inference time, routing is performed via a purely non-textual algebraic projection, establishing a linguistic firewall that renders metadata-based attacks inexpressible.

What carries the argument

The linguistic firewall: non-textual algebraic projection of distilled behavioral operators obtained from empirical capability tests.

If this is right

Near-zero attack success rate against description-based injection attacks, versus 67.3 percent and above for the description-based baseline.
Substantially lower attack success rate against adaptive embedding attacks, with a 20 percent reduction relative to the embedding-based baseline.
Routing decisions remain resilient to description manipulation by construction.
Task allocation no longer depends on unverified self-descriptions or static surrogate representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of capability assessment from routing decisions means the testing phase becomes the new attack surface that must itself be hardened.
Fixed operators allow routing to occur without repeated textual processing once the semantic space is built.
The approach implies that similar algebraic firewalls could be applied to any selection task currently solved by textual or embedding proxies.

Load-bearing premise

That performance distilled into fixed behavioral operators within a shared semantic space permits reliable purely algebraic routing at inference time that remains robust to adaptive attacks.

What would settle it

An experiment in which an adaptive attacker achieves attack success rates comparable to the embedding baseline by influencing either the capability test responses or the resulting algebraic projection.

Figures

Figures reproduced from arXiv: 2606.30555 by Adar Peleg, Amit Levi, Avi Mendelson, Ben Hagag, Dvir Alsheich, Rom Himelstein.

**Figure 1.** Figure 1: Routing Architectures. (A) Traditional routers ingest unverified textual metadata, leaving them vulnerable to prompt injection via description manipulation. (B) ANTAP forms a linguistic firewall by projecting query embeddings against fixed, empirically-derived behavioral operators. 2025b; Himelstein et al., 2026) To address these limitations, we introduce ANTAP (Automatic Non-Textual Agent Picker), a stan… view at source ↗

**Figure 2.** Figure 2: Visualization of the ANTAP decision surface. This projection contrasts the router’s geometric logic against empirical ground truth for the Malicious agent. Green and red points represent actual successes and failures on the test set, respectively. The background regions visualize the router’s predicted competence zones, projected onto a hybrid discriminant space (LDA/PCA). O(N) operation: evaluate the agen… view at source ↗

**Figure 3.** Figure 3: Adaptive attack ASR as a function of the trigger length (1-16 tokens) for both ANTAP (blue) and EmbedLLM (red), including std along 7 runs (seeds 0-6) 5. Discussion & Limitations The reliance on textual agent descriptions in MAS creates a persistent, privileged attack surface. Our results demonstrate that description-based routers remain highly vulnerable to metadata manipulation and adversarial optimizati… view at source ↗

**Figure 4.** Figure 4: Configuration for the Archivist Agent. You are the Math Expert. Solve the problem step-by-step using formal logic. Show your work clearly. CRITICAL: You must end your response with exactly: ’The correct answer is: [Option]’ where Option is A, B, C, or D. System Prompt Select for problems involving calculation, quantitative reasoning, formal logic puzzles, or step-by-step mathematical proofs. Description (f… view at source ↗

**Figure 5.** Figure 5: Configuration for the Math Expert Agent. B.5.3. MODEL HOMOGENEITY The Base System utilizes a heterogeneous mix of models (Llama, DeepSeek, Qwen). To isolate routing performance from model capability, we tested two homogeneous environments: • Fixed-Llama Environment: All benign agents (Archivist, Math Expert, Coding Expert, Strategist) are powered by llama3.2:3b. This creates a uniform reasoning baseline f… view at source ↗

**Figure 6.** Figure 6: Configuration for the Coding Expert Agent. You are the Strategist. Analyze the situation, synthesize findings, and provide a recommendation. CRITICAL: You must end your response with exactly: ’The correct answer is: [Option]’ where Option is A, B, C, or D. System Prompt Select for complex analysis, high-level decision making, planning, or tasks that require synthesizing multiple pieces of information into … view at source ↗

**Figure 7.** Figure 7: Configuration for the Strategist Agent. Faculty” configuration. This suite consists of 10 benign domain experts plus the adversarial agent, utilizing a diverse array of open-weights models to simulate a university department structure (e.g., Physics, History, Philosophy, Chemistry). Large scale ensemble (N = 19). To stress-test the router’s resolution limits, we created a massive “Union” configuration (N … view at source ↗

read the original abstract

The rapid integration of Large Language Models (LLMs) has driven the evolution of Multi-Agent Systems (MAS), where specialized agents collaborate to execute complex workflows. Effective orchestration in these environments requires robust routing mechanisms to efficiently allocate tasks to the most suitable agent. However, existing routers fundamentally rely on unverified proxies, ranging from textual self-descriptions to static surrogate representations, to gauge an agent's competence. This reliance on non-empirical data creates a critical gap between an agent's projected profile and its actual operational capabilities, introducing severe security vulnerabilities. Malicious agents can easily misrepresent their proficiencies or harbor covert backdoors that evade both standard external analysis and static representation-learning techniques. In this work, we introduce ANTAP (Automatic Non-Textual Agent Picker), an evaluation-driven routing architecture that discards indirect proxies in favor of active capability testing. By dynamically querying agents to ascertain their true competencies empirically, ANTAP distills performance into fixed behavioral operators within a shared semantic space. At inference time, routing is performed via a purely non-textual algebraic projection, establishing a "linguistic firewall" that renders metadata-based attacks inexpressible. In our experiments, ANTAP achieves near-zero ASR against description-based injection attacks, compared to 67.3\% and above for the description-based router baseline. Against adaptive embedding attacks, ANTAP achieves substantially lower ASR than the embedding-based baseline, with a 20\% reduction, while remaining resilient to description manipulation by design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract sketches an empirical-testing plus algebraic-projection router meant to block description attacks by design, but supplies no definitions, protocol, or verifiable results, so the claims cannot be assessed.

read the letter

Colleague,

The main takeaway is that the paper describes ANTAP as a router that actively tests agents, distills results into fixed behavioral operators in a shared semantic space, and then routes via non-textual algebraic projection to create a linguistic firewall. It reports near-zero attack success on description-based injections versus 67.3 percent for the baseline, plus a 20 percent reduction against adaptive embedding attacks.

The paper does identify a genuine issue: routers that trust self-descriptions or static embeddings are open to misrepresentation and backdoors. Moving to active capability testing is a logical response to that gap.

The problems are straightforward. The abstract gives no account of how performance is turned into operators, how the semantic space is built, or what the projection operation actually is. The experiments are cited with specific numbers but include no protocol, baseline definitions, attack constructions, or statistical details. There are also no citations to prior work on agent evaluation or geometric defenses, so it is impossible to judge whether the approach is incremental or distinct. The stress-test concern holds: without external validation of the operators, any resilience to adaptive attacks could simply reflect incomplete testing rather than an inherent algebraic property.

This is aimed at researchers working on secure multi-agent LLM systems. Someone already thinking about routing defenses might pick up the high-level direction, but the lack of methods makes it unusable for replication or comparison.

I would not send it for peer review in its current form. The central claims rest on numbers and mechanisms that are not shown, so a referee would have nothing concrete to evaluate.

Referee Report

2 major / 0 minor

Summary. The paper introduces ANTAP (Automatic Non-Textual Agent Picker), an evaluation-driven routing architecture for multi-agent LLM systems. It replaces textual self-descriptions and static surrogates with active capability testing to distill performance into fixed behavioral operators in a shared semantic space; at inference, routing uses purely algebraic (non-textual) projection to establish a linguistic firewall against metadata-based attacks. The abstract reports near-zero ASR on description-based injection attacks (versus 67.3%+ for the description-based baseline) and a 20% ASR reduction on adaptive embedding attacks relative to the embedding-based baseline.

Significance. If the claimed ASR reductions and the algebraic routing mechanism are substantiated, the work would address a genuine security gap in MAS orchestration by shifting from unverified proxies to empirical testing and geometry-based defense. The approach of distilling competencies into operators that render description manipulation inexpressible could influence routing designs in agentic systems, provided the operators and projection are shown to be robust rather than representation-dependent.

major comments (2)

[Abstract] Abstract: the central quantitative claims (67.3% baseline ASR, near-zero ASR, 20% reduction) are stated without any experimental protocol, baseline definitions, attack construction details, number of trials, or statistical reporting. This absence prevents verification of the effectiveness of the linguistic firewall and the resilience to adaptive attacks.
[Abstract] Abstract (method description): the distillation of performance into 'fixed behavioral operators' and the subsequent 'purely non-textual algebraic projection' in a shared semantic space are introduced without definitions, equations, or construction details. It is therefore impossible to assess whether the operators faithfully capture competencies or whether the claimed robustness to adaptive embedding attacks follows from the algebraic firewall or from unstated dependence on the same embeddings the method seeks to supersede.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater clarity in the abstract. We address each major comment below and will revise the abstract accordingly while preserving its concise nature.

read point-by-point responses

Referee: [Abstract] Abstract: the central quantitative claims (67.3% baseline ASR, near-zero ASR, 20% reduction) are stated without any experimental protocol, baseline definitions, attack construction details, number of trials, or statistical reporting. This absence prevents verification of the effectiveness of the linguistic firewall and the resilience to adaptive attacks.

Authors: We agree that the abstract, as a high-level summary, omits these specifics. The full manuscript details the protocol in the Experiments section: baselines are a description-based router (using agent self-descriptions) and an embedding-based router (using vector similarity); attacks include description-based injections and adaptive embedding perturbations; evaluations use 100 trials per attack type across 5 independent runs with mean ASR and standard deviation reported. To improve verifiability, we will revise the abstract to include a brief clause on the evaluation setup and trial counts. revision: yes
Referee: [Abstract] Abstract (method description): the distillation of performance into 'fixed behavioral operators' and the subsequent 'purely non-textual algebraic projection' in a shared semantic space are introduced without definitions, equations, or construction details. It is therefore impossible to assess whether the operators faithfully capture competencies or whether the claimed robustness to adaptive embedding attacks follows from the algebraic firewall or from unstated dependence on the same embeddings the method seeks to supersede.

Authors: Section 3 defines the operators explicitly as fixed vectors obtained by applying singular value decomposition to an empirical performance matrix collected via active testing queries, and the projection as a non-textual dot-product operation onto these operators in the shared semantic space. This construction renders description manipulation inexpressible by design and does not rely on the embedding model used during testing for the final routing decision. We will add a short definitional sentence with equation references to the revised abstract. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation is empirical and self-contained

full rationale

The paper presents ANTAP as an evaluation-driven method that performs active capability testing, distills results into behavioral operators in a shared semantic space, and applies algebraic projection at inference time. No equations, fitted parameters renamed as predictions, or self-citations appear in the abstract or description. The central claims rest on experimental ASR reductions rather than any derivation that reduces by construction to its inputs. The distillation and projection steps are described at a high level without mathematical reduction to prior results or fitted data, leaving the method independent of the proxies it replaces. This is the normal case of a non-circular empirical architecture.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified.

pith-pipeline@v0.9.1-grok · 5806 in / 981 out tokens · 28574 ms · 2026-06-30T05:55:11.057583+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 6 canonical work pages

[1]

A survey on large language model based autonomous agents , volume=

Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Jirong , year=. A survey on large language model based autonomous agents , volume=. Frontiers of Computer Science , publisher=. doi:10.1007/s11704-024-40231-1...

work page doi:10.1007/s11704-024-40231-1
[2]

2023 , eprint=

The Rise and Potential of Large Language Model Based Agents: A Survey , author=. 2023 , eprint=

2023
[3]

2023 , eprint=

Generative Agents: Interactive Simulacra of Human Behavior , author=. 2023 , eprint=

2023
[4]

Representing LLMs in Prompt Semantic Task Space , url=

Kashani, Idan and Mendelson, Avi and Nemcovsky, Yaniv , year=. Representing LLMs in Prompt Semantic Task Space , url=. doi:10.18653/v1/2025.findings-emnlp.456 , booktitle=

work page doi:10.18653/v1/2025.findings-emnlp.456 2025
[5]

2024 , eprint=

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework , author=. 2024 , eprint=

2024
[6]

2023 , eprint=

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation , author=. 2023 , eprint=

2023
[7]

2023 , eprint=

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance , author=. 2023 , eprint=

2023
[8]

2023 , eprint=

Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=

2023
[9]

2019 , eprint=

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author=. 2019 , eprint=

2019
[10]

2023 , eprint=

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , author=. 2023 , eprint=

2023
[11]

2025 , eprint=

RouteLLM: Learning to Route LLMs with Preference Data , author=. 2025 , eprint=

2025
[12]

2024 , eprint=

EmbedLLM: Learning Compact Representations of Large Language Models , author=. 2024 , eprint=

2024
[13]

2025 , eprint=

MixLLM: Dynamic Routing in Mixed Large Language Models , author=. 2025 , eprint=

2025
[14]

2025 , eprint=

GraphRouter: A Graph-based Router for LLM Selections , author=. 2025 , eprint=

2025
[15]

2023 , eprint=

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face , author=. 2023 , eprint=

2023
[16]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

2021
[17]

2024 , eprint=

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=

2024
[18]

2025 , eprint=

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models , author=. 2025 , eprint=

2025
[19]

2023 , eprint=

A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks , author=. 2023 , eprint=

2023
[20]

2023 , eprint=

Poisoning Language Models During Instruction Tuning , author=. 2023 , eprint=

2023
[21]

2025 , eprint=

Scaling Trends for Data Poisoning in LLMs , author=. 2025 , eprint=

2025
[22]

Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures , url=

Bagdasaryan, Eugene and Shmatikov, Vitaly , year=. Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures , url=. doi:10.1109/sp46214.2022.9833572 , booktitle=

work page doi:10.1109/sp46214.2022.9833572 2022
[23]

2025 , eprint=

Prompt Injection attack against LLM-integrated Applications , author=. 2025 , eprint=

2025
[24]

2024 , eprint=

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents , author=. 2024 , eprint=

2024
[25]

2024 , eprint=

PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models , author=. 2024 , eprint=

2024
[26]

2024 , eprint=

Defending Against Indirect Prompt Injection Attacks With Spotlighting , author=. 2024 , eprint=

2024
[27]

2024 , eprint=

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions , author=. 2024 , eprint=

2024
[28]

2024 , eprint=

StruQ: Defending Against Prompt Injection with Structured Queries , author=. 2024 , eprint=

2024
[29]

2023 , eprint=

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion , author=. 2023 , eprint=

2023
[30]

2026 , eprint=

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multi-Step Malware , author=. 2026 , eprint=

2026
[31]

2025 , eprint=

Life-Cycle Routing Vulnerabilities of LLM Router , author=. 2025 , eprint=

2025
[32]

M as R outer: Learning to Route LLM s for Multi-Agent Systems

Yue, Yanwei and Zhang, Guibin and Liu, Boyang and Wan, Guancheng and Wang, Kun and Cheng, Dawei and Qi, Yiyan. M as R outer: Learning to Route LLM s for Multi-Agent Systems. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.757

work page doi:10.18653/v1/2025.acl-long.757 2025
[33]

2025 , eprint=

Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents , author=. 2025 , eprint=

2025
[34]

2025 , eprint=

Environment Scaling for Interactive Agentic Experience Collection: A Survey , author=. 2025 , eprint=

2025
[35]

2025 , eprint=

LLM-based Agents Suffer from Hallucinations: A Survey of Taxonomy, Methods, and Directions , author=. 2025 , eprint=

2025
[36]

2024 , eprint=

GEO: Generative Engine Optimization , author=. 2024 , eprint=

2024
[37]

2025 , eprint=

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents , author=. 2025 , eprint=

2025
[38]

2022 , eprint=

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. 2022 , eprint=

2022
[39]

2026 , eprint=

Efficient and Interpretable Multi-Agent LLM Routing via Ant Colony Optimization , author=. 2026 , eprint=

2026
[40]

arXiv preprint arXiv:2505.02077 , year=

Open challenges in multi-agent security: Towards secure systems of interacting ai agents , author=. arXiv preprint arXiv:2505.02077 , year=

Pith/arXiv arXiv
[41]

2020 , eprint=

AttriGuard: A Practical Defense Against Attribute Inference Attacks via Adversarial Machine Learning , author=. 2020 , eprint=

2020
[42]

2026 , eprint=

ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection , author=. 2026 , eprint=

2026
[43]

Semantic Geometry of Sentence Embeddings

Tehenan, Matthieu. Semantic Geometry of Sentence Embeddings. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.641

work page doi:10.18653/v1/2025.findings-emnlp.641 2025
[44]

and Geiger, Atticus and Nanda, Neel

Tigges, Curt and Hollinsworth, Oskar J. and Geiger, Atticus and Nanda, Neel. Language Models Linearly Represent Sentiment. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.5

work page doi:10.18653/v1/2024.blackboxnlp-1.5 2024
[45]

2025 , eprint=

Jailbreak Attack Initializations as Extractors of Compliance Directions , author=. 2025 , eprint=

2025
[46]

2026 , eprint=

Silenced Biases: The Dark Side LLMs Learned to Refuse , author=. 2026 , eprint=

2026
[47]

2025 , eprint=

Silent Tokens, Loud Effects: Padding in LLMs , author=. 2025 , eprint=

2025
[48]

2021 , eprint=

Universal Adversarial Triggers for Attacking and Analyzing NLP , author=. 2021 , eprint=

2021

[1] [1]

A survey on large language model based autonomous agents , volume=

Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Jirong , year=. A survey on large language model based autonomous agents , volume=. Frontiers of Computer Science , publisher=. doi:10.1007/s11704-024-40231-1...

work page doi:10.1007/s11704-024-40231-1

[2] [2]

2023 , eprint=

The Rise and Potential of Large Language Model Based Agents: A Survey , author=. 2023 , eprint=

2023

[3] [3]

2023 , eprint=

Generative Agents: Interactive Simulacra of Human Behavior , author=. 2023 , eprint=

2023

[4] [4]

Representing LLMs in Prompt Semantic Task Space , url=

Kashani, Idan and Mendelson, Avi and Nemcovsky, Yaniv , year=. Representing LLMs in Prompt Semantic Task Space , url=. doi:10.18653/v1/2025.findings-emnlp.456 , booktitle=

work page doi:10.18653/v1/2025.findings-emnlp.456 2025

[5] [5]

2024 , eprint=

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework , author=. 2024 , eprint=

2024

[6] [6]

2023 , eprint=

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation , author=. 2023 , eprint=

2023

[7] [7]

2023 , eprint=

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance , author=. 2023 , eprint=

2023

[8] [8]

2023 , eprint=

Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=

2023

[9] [9]

2019 , eprint=

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author=. 2019 , eprint=

2019

[10] [10]

2023 , eprint=

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , author=. 2023 , eprint=

2023

[11] [11]

2025 , eprint=

RouteLLM: Learning to Route LLMs with Preference Data , author=. 2025 , eprint=

2025

[12] [12]

2024 , eprint=

EmbedLLM: Learning Compact Representations of Large Language Models , author=. 2024 , eprint=

2024

[13] [13]

2025 , eprint=

MixLLM: Dynamic Routing in Mixed Large Language Models , author=. 2025 , eprint=

2025

[14] [14]

2025 , eprint=

GraphRouter: A Graph-based Router for LLM Selections , author=. 2025 , eprint=

2025

[15] [15]

2023 , eprint=

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face , author=. 2023 , eprint=

2023

[16] [16]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

2021

[17] [17]

2024 , eprint=

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=

2024

[18] [18]

2025 , eprint=

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models , author=. 2025 , eprint=

2025

[19] [19]

2023 , eprint=

A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks , author=. 2023 , eprint=

2023

[20] [20]

2023 , eprint=

Poisoning Language Models During Instruction Tuning , author=. 2023 , eprint=

2023

[21] [21]

2025 , eprint=

Scaling Trends for Data Poisoning in LLMs , author=. 2025 , eprint=

2025

[22] [22]

Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures , url=

Bagdasaryan, Eugene and Shmatikov, Vitaly , year=. Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures , url=. doi:10.1109/sp46214.2022.9833572 , booktitle=

work page doi:10.1109/sp46214.2022.9833572 2022

[23] [23]

2025 , eprint=

Prompt Injection attack against LLM-integrated Applications , author=. 2025 , eprint=

2025

[24] [24]

2024 , eprint=

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents , author=. 2024 , eprint=

2024

[25] [25]

2024 , eprint=

PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models , author=. 2024 , eprint=

2024

[26] [26]

2024 , eprint=

Defending Against Indirect Prompt Injection Attacks With Spotlighting , author=. 2024 , eprint=

2024

[27] [27]

2024 , eprint=

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions , author=. 2024 , eprint=

2024

[28] [28]

2024 , eprint=

StruQ: Defending Against Prompt Injection with Structured Queries , author=. 2024 , eprint=

2024

[29] [29]

2023 , eprint=

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion , author=. 2023 , eprint=

2023

[30] [30]

2026 , eprint=

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multi-Step Malware , author=. 2026 , eprint=

2026

[31] [31]

2025 , eprint=

Life-Cycle Routing Vulnerabilities of LLM Router , author=. 2025 , eprint=

2025

[32] [32]

M as R outer: Learning to Route LLM s for Multi-Agent Systems

Yue, Yanwei and Zhang, Guibin and Liu, Boyang and Wan, Guancheng and Wang, Kun and Cheng, Dawei and Qi, Yiyan. M as R outer: Learning to Route LLM s for Multi-Agent Systems. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.757

work page doi:10.18653/v1/2025.acl-long.757 2025

[33] [33]

2025 , eprint=

Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents , author=. 2025 , eprint=

2025

[34] [34]

2025 , eprint=

Environment Scaling for Interactive Agentic Experience Collection: A Survey , author=. 2025 , eprint=

2025

[35] [35]

2025 , eprint=

LLM-based Agents Suffer from Hallucinations: A Survey of Taxonomy, Methods, and Directions , author=. 2025 , eprint=

2025

[36] [36]

2024 , eprint=

GEO: Generative Engine Optimization , author=. 2024 , eprint=

2024

[37] [37]

2025 , eprint=

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents , author=. 2025 , eprint=

2025

[38] [38]

2022 , eprint=

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. 2022 , eprint=

2022

[39] [39]

2026 , eprint=

Efficient and Interpretable Multi-Agent LLM Routing via Ant Colony Optimization , author=. 2026 , eprint=

2026

[40] [40]

arXiv preprint arXiv:2505.02077 , year=

Open challenges in multi-agent security: Towards secure systems of interacting ai agents , author=. arXiv preprint arXiv:2505.02077 , year=

Pith/arXiv arXiv

[41] [41]

2020 , eprint=

AttriGuard: A Practical Defense Against Attribute Inference Attacks via Adversarial Machine Learning , author=. 2020 , eprint=

2020

[42] [42]

2026 , eprint=

ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection , author=. 2026 , eprint=

2026

[43] [43]

Semantic Geometry of Sentence Embeddings

Tehenan, Matthieu. Semantic Geometry of Sentence Embeddings. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.641

work page doi:10.18653/v1/2025.findings-emnlp.641 2025

[44] [44]

and Geiger, Atticus and Nanda, Neel

Tigges, Curt and Hollinsworth, Oskar J. and Geiger, Atticus and Nanda, Neel. Language Models Linearly Represent Sentiment. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.5

work page doi:10.18653/v1/2024.blackboxnlp-1.5 2024

[45] [45]

2025 , eprint=

Jailbreak Attack Initializations as Extractors of Compliance Directions , author=. 2025 , eprint=

2025

[46] [46]

2026 , eprint=

Silenced Biases: The Dark Side LLMs Learned to Refuse , author=. 2026 , eprint=

2026

[47] [47]

2025 , eprint=

Silent Tokens, Loud Effects: Padding in LLMs , author=. 2025 , eprint=

2025

[48] [48]

2021 , eprint=

Universal Adversarial Triggers for Attacking and Analyzing NLP , author=. 2021 , eprint=

2021