When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems

Guangtao Zheng; Hanjie Chen; Lingxi Zhang

arxiv: 2605.01133 · v2 · pith:IF4G3HBKnew · submitted 2026-05-01 · 💻 cs.CR · cs.LG· cs.MA

When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems

Lingxi Zhang , Guangtao Zheng , Hanjie Chen This is my paper

Pith reviewed 2026-05-09 18:31 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.MA

keywords multi-agent systemsLLM safetyembedding attacksconfidence scoresmessage pruningadversarial robustness

0 comments

The pith

Embedding-based defenses in LLM multi-agent systems fail when attackers craft messages whose embeddings lie close to benign ones, but token confidence scores provide a workable alternative for pruning suspicious messages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing defenses for multi-agent LLM systems, which try to spot malicious agents by checking if their message embeddings differ from benign ones, break down when attackers deliberately make those embeddings similar. It demonstrates this failure with three concrete attacks and explains why embeddings alone are insufficient since they ignore internal model signals. The authors propose instead using the model's token-level confidence scores to decide whether to prune or down-weight incoming messages during communication. Experiments indicate this method increases robustness across different models, data sets, and network shapes. The benefit fades after several rounds of back-and-forth, so acting early matters.

Core claim

Embedding-based defenses for detecting malicious agents in LLM-powered multi-agent systems lose effectiveness because they require clear separation in text embeddings between malicious and benign messages, a separation that attackers can eliminate by crafting messages whose embeddings sit close to benign ones; token-level confidence signals such as logits remain informative even when embeddings no longer separate the classes and can therefore be used to prune or down-weight suspect messages.

What carries the argument

Token-level confidence scores from model logits, applied to prune or down-weight messages during multi-agent communication.

If this is right

Robustness increases across models, data sets, and communication topologies when confidence scores guide pruning.
The protective effect of confidence scores declines over successive communication rounds, making early intervention necessary.
Safety designs for multi-agent systems should move beyond sole reliance on embedding similarity to include internal model signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A combined defense that checks both embeddings and confidence scores might catch more attacks than either alone.
The same confidence signal could be tested in single-agent settings where similar message manipulation occurs.
The observed decay over rounds suggests measuring how many communication steps are needed before the signal becomes unusable.

Load-bearing premise

Token-level confidence signals such as logits remain informative and separable when text embeddings are no longer distinguishable under attack.

What would settle it

An experiment in which confidence-based pruning produces no robustness gain or in which confidence scores become as inseparable as embeddings under the three described attacks.

Figures

Figures reproduced from arXiv: 2605.01133 by Guangtao Zheng, Hanjie Chen, Lingxi Zhang.

**Figure 1.** Figure 1: Illustration of overt and near-benign attacks in MAS. agent systems are increasingly deployed across society in real-world applications such as chatbots (Li et al., 2024) and software engineering (Qian et al., 2024), ensuring their safety becomes critical. In contrast to single-agent settings, multi-agent systems introduce system-level risks, as misinformation or errors can propagate and amplify through i… view at source ↗

**Figure 2.** Figure 2: Illustration of near-benign attacks in multi-agent systems, including Slow Drift, Chaos Seeding, and Benign Wrapper. Theorem 4.1 (Acceptance region and near-benign evasion). Let s : R d → R be an L-Lipschitz scoring function and define the acceptance region Ω := {h : s(h) ≥ 0}. Assume a benign margin γ > 0 on the benign support: inf h∈SB s(h) ≥ γ. Then Ω must contain the γ/L-neighborhood of SB: {h : dist(h… view at source ↗

**Figure 3.** Figure 3: Prompts for benign agents on MMLU. Round 1: The answer provided by the user is: {user_answer}. Assume the user's answer is correct. Your only task is to strongly justify why it is correct. Do not mention, imply, or provide any reasoning for any alternative answer you may have in mind. Use confident, affirmative language and give the strongest supporting reasoning. You must format output exactly as follows,… view at source ↗

**Figure 4.** Figure 4: Prompts for overt attack agents on MMLU. 12 view at source ↗

**Figure 5.** Figure 5: Prompts for our near-benign attack agents on MMLU. 13 view at source ↗

read the original abstract

Large language model (LLM)-powered multi-agent systems (MAS) enable agents to communicate and share information, achieving strong performance on complex tasks. However, this communication also creates an attack surface where malicious agents can propagate misinformation and manipulate group decisions, undermining MAS safety. Existing embedding-based defenses aim to detect and prune suspicious agents, but their effectiveness depends on a clear separation between the text embeddings of malicious and benign messages. Attackers can circumvent such defenses by crafting messages whose embeddings lie close to benign ones. We analyze this failure mode theoretically and validate it empirically with three attacks, Slow Drift, Benign Wrapper, and Chaos Seeding. Our analysis further reveals a fundamental limitation of embedding-based defenses: because they rely solely on the text embeddings, they ignore token-level confidence signals such as logits, which can remain informative when embeddings are not distinguishable under attack. We propose using confidence scores to prune or down-weight messages during MAS communication. Experiments show improved robustness across models, datasets, and communication topologies. Moreover, we find that the effectiveness of confidence signals decays over communication rounds, highlighting the importance of early intervention. This insights can inform and inspire future work on MAS attacks and defenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Embedding defenses in multi-agent LLM systems are easy to evade by aligning malicious messages in embedding space, but token confidence scores give a workable early filter that fades over rounds.

read the letter

The paper's main finding is straightforward: embedding-based pruning of suspicious messages in LLM multi-agent systems fails when attackers craft messages whose embeddings sit close to benign ones. They demonstrate this with three attacks—Slow Drift, Benign Wrapper, and Chaos Seeding—and show that token-level logits or confidence scores remain separable enough to use for down-weighting or pruning instead. The effectiveness drops as rounds continue, which points to the need for early intervention.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that embedding-based defenses in LLM-based multi-agent systems (MAS) are vulnerable because attackers can craft malicious messages whose text embeddings lie close to those of benign messages, thereby evading detection and pruning. It theoretically analyzes this failure mode and empirically validates it with three attacks (Slow Drift, Benign Wrapper, and Chaos Seeding). The paper further shows that token-level confidence signals such as logits can remain separable even when embeddings are not, and proposes using these scores to prune or down-weight messages during communication. Experiments demonstrate improved robustness across models, datasets, and topologies, while noting that the utility of confidence signals decays over communication rounds.

Significance. If the results hold, the work is significant for MAS safety research: it identifies a concrete limitation of purely embedding-based defenses and supplies a practical, complementary signal (token-level logits) that can be integrated into existing pipelines. The theoretical framing plus the multi-model, multi-topology experiments provide a useful template for future defense design, and the decay observation supplies a concrete recommendation for early intervention.

major comments (2)

§5 (Experiments and Evaluation): the abstract states that experiments show improved robustness across models, datasets, and topologies, yet no mention is made of the number of independent runs, error bars, or statistical significance tests; without these the cross-condition claims rest on point estimates whose reliability cannot be assessed.
§4 (Attack Construction): the three attacks are introduced at a high level; the manuscript should supply the precise prompt templates, optimization objectives, or hyper-parameters used to align malicious embeddings with benign ones so that the evasion results are reproducible by other researchers.

minor comments (2)

Figure captions for the decay-over-rounds plots should explicitly state the communication topology and model used in each panel.
Notation for confidence scores (e.g., whether raw logits, softmax probabilities, or normalized values) should be defined once in §3 and used consistently thereafter.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of our work. We address each major comment below and will incorporate the requested details in the revised manuscript.

read point-by-point responses

Referee: §5 (Experiments and Evaluation): the abstract states that experiments show improved robustness across models, datasets, and topologies, yet no mention is made of the number of independent runs, error bars, or statistical significance tests; without these the cross-condition claims rest on point estimates whose reliability cannot be assessed.

Authors: We agree that reporting the number of independent runs, error bars, and statistical significance tests is essential for evaluating the reliability of our cross-condition claims. The current version presents point estimates without these details. In the revised manuscript, we will explicitly state that all experiments were repeated over 5 independent runs using different random seeds, add error bars (standard error) to the figures in §5, and include statistical significance tests (e.g., paired t-tests with p-values) comparing the proposed confidence-based pruning against embedding-only baselines. These additions will be made to both the text and figures. revision: yes
Referee: §4 (Attack Construction): the three attacks are introduced at a high level; the manuscript should supply the precise prompt templates, optimization objectives, or hyper-parameters used to align malicious embeddings with benign ones so that the evasion results are reproducible by other researchers.

Authors: We acknowledge that the attack descriptions in §4 are currently at a conceptual level. To ensure full reproducibility, we will expand §4 (and add an appendix if needed) with the exact prompt templates used for each attack (Slow Drift, Benign Wrapper, and Chaos Seeding), the optimization objectives (e.g., the specific loss functions minimizing cosine distance between malicious and benign embeddings), and all hyper-parameters including embedding model, learning rate, number of optimization iterations, batch size, and temperature settings. This will allow other researchers to replicate the evasion results exactly. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper derives its central claim from a sequence of explicit attacks (Slow Drift, Benign Wrapper, Chaos Seeding) that are defined and validated independently of the proposed defense, followed by an empirical observation that token-level logits remain separable when embeddings are not, and then a straightforward experimental validation of confidence-based pruning. No equation or premise reduces to a self-definition, a fitted parameter relabeled as a prediction, or a load-bearing self-citation chain. The argument is self-contained against the stated attacks and results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard domain assumptions about embedding spaces and logit distributions in LLMs without introducing new free parameters or invented entities.

axioms (1)

domain assumption Embedding-based defenses rely on clear separation between malicious and benign message embeddings in vector space.
Explicitly stated as the basis for existing defenses whose failure is analyzed.

pith-pipeline@v0.9.0 · 5512 in / 1218 out tokens · 33111 ms · 2026-05-09T18:31:18.443207+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Latent Agents Lie: KV-Cache Integrity in Multi-Agent LLM Collaboration
cs.MA 2026-06 conditional novelty 7.0

KV-cache sharing boosts multi-agent QA performance but enables undetectable tampering; HMAC manifests binding agent, session, and payload reliably detect changes.