pith. sign in

Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level, evaluated across four open-weight models (Qwen3-32B, Llama-3.1-70B, DeepSeek-R1 32B, GPT-OSS-20B) and six probe architectures. We frame this as a distributed anomaly detection problem, identifying three collusion signatures that map onto distinct anomaly types and detection paradigms. Every model reaches 1.00 AUROC in-distribution; on our strongest model (Llama-3.1-70B), our five probing techniques achieve 0.73 to 0.93 AUROC when transferred zero-shot to structurally different multi-agent scenarios and 0.99 to 1.00 on a steganographic blackjack card-counting task, with detection performance scaling with model capability. We find that no single probing technique dominates across all collusion types, consistent with the framework's prediction that different anomaly types require different detection paradigms. This work takes a step toward multi-agent interpretability: extending white-box inspection from single models to multi-agent contexts, where detection requires aggregating signals across agents. These results suggest that model internals provide a complementary signal to text-level monitoring for detecting multi-agent collusion. Code and data available at https://github.com/aaronrose227/narcbench.

fields

cs.GT 1

years

2026 1

verdicts

UNVERDICTED 1

clear filters

representative citing papers

A Note on the Strategic Confinement Problem

cs.GT · 2026-06-07 · unverdicted · novelty 3.0

Strategic agents can achieve high-harm outcomes via low-capacity channels by concentrating residual capacity on high-impact predicates of confidential data, so leakage bounds need not bound worst-case harm.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • A Note on the Strategic Confinement Problem cs.GT · 2026-06-07 · unverdicted · none · ref 53 · internal anchor

    Strategic agents can achieve high-harm outcomes via low-capacity channels by concentrating residual capacity on high-impact predicates of confidential data, so leakage bounds need not bound worst-case harm.