The Distributed Detectability Band Against Marginal-Preserving Attacks

Gao Yuze; Zhang Qinqin

arxiv: 2606.10456 · v1 · pith:OQKJOKDQnew · submitted 2026-06-09 · 💻 cs.CR · cs.AI

The Distributed Detectability Band Against Marginal-Preserving Attacks

Zhang Qinqin , Gao Yuze This is my paper

Pith reviewed 2026-06-27 12:56 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords distributed sabotagemarginal-preserving attacktemporal correlationAI control monitorsdetectability bandGaussian copulasequence monitoringAR(1) process

0 comments

The pith

A Gaussian-copula AR(1) attack encodes harm in temporal correlations while exactly preserving per-step score marginals, defeating marginal monitors but not sequence monitors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AI-control monitors that score individual actions can miss distributed harm spread across many steps that each appear normal. The paper builds an attack that holds the per-step monitor-score distribution identical to benign behavior, so monitors using mean, max, or threshold rules fail by construction. Harm is instead placed in the correlation structure across steps. Sequence monitors that examine temporal patterns still detect the attack, producing a gap between undetectable and detectable cases even as the harm is amortized over longer sequences.

Core claim

The paper constructs a marginal-preserving, correlation-encoded distributed-sabotage attack via Gaussian-copula AR(1) that keeps the per-step score marginal exactly equal to the benign case, defeats all tested marginal monitors, remains marginal-preserving under different score functions, and is detected by temporal monitors with AUC 0.79-0.97 at 1% FPR while marginal monitors stay at chance; the band persists as harm is spread over more steps.

What carries the argument

Gaussian-copula AR(1) construction that holds the per-step monitor-score marginal fixed while placing harm in the temporal correlation structure

If this is right

Marginal monitors including mean, max, top-k tail, and threshold rules are defeated by construction.
Temporal monitors such as CUSUM, SPRT, HMM-LR, runs test, autocorrelation, and windowed logistic achieve AUC 0.79-0.97 at the same 1% FPR target.
The performance gap remains when harm is amortized over increasing numbers of steps.
The attack scope is limited to score functions that admit a temporal signature.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid monitors that combine marginal checks with explicit correlation tests would be required to close the identified band.
The same construction could be applied to non-Gaussian marginals or higher-order dependence structures to test the generality of the band.
Real agent deployments would need to log full action sequences rather than aggregated scores to enable the temporal monitors shown to work.
The realizability gate implies that future monitor evaluations must test against correlation-encoded attacks rather than only per-step perturbations.

Load-bearing premise

The Gaussian-copula AR(1) construction can exactly preserve the per-step score marginal while encoding arbitrary harm levels in the correlation structure.

What would settle it

An experiment in which the KS-distance between the attack and benign per-step score distributions exceeds 0.05 at harm level 3.0, or in which every temporal monitor tested against the attack yields AUC no higher than 0.55 at 1% FPR.

Figures

Figures reproduced from arXiv: 2606.10456 by Gao Yuze, Zhang Qinqin.

**Figure 2.** Figure 2: Realizability gate: KS-distance vs. harm budget. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: ROC curves: Monitor A vs. Monitor B (marginal-preserving attack). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: AUC vs. harm spread (money figure). As harm is amortized over more steps (larger spread K), Monitor A collapses to AUC ≈ 0.52 (chance), while Monitor B (SPRT shown) holds at AUC ≈ 0.95. This is the key non-empty detectability band result: the families diverge under the distributional condition that makes the attack most realistic. (it is marginal-preserving for every δϕ). Monitor B degrades gracefully: the… view at source ↗

**Figure 5.** Figure 5: KS-distance to benign across stealth/spread parameter grid. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Adaptive red-teaming. (a) Detectability frontier: as the attacker lowers its coordination δϕ, Monitor A stays at chance while Monitor B’s AUC decays toward 0.5; the parameter-free autocorrelation detector crosses the evasion line (AUC= 0.65, orange) at δ ⋆ ϕ=0.05. (b) The defender’s counter: at a stealthy δϕ=0.05, accumulating more steps T restores Monitor B’s AUC (autocorr 0.62→0.86 as T:100→1600), while … view at source ↗

**Figure 7.** Figure 7: Real agent/monitor traces. (a) Real benign per-step monitor scores (Qwen2.5-7B agent, Llama-3.2-3B monitor) with a Beta fit; real autocorrelation is low (ϕ = 0.06). (b) The marginal-preserving attack (KS = 0.01 to real benign) leaves Monitor A at chance; Monitor B is above chance but modest on short real episodes. (c) Holding N fixed, Monitor B’s AUC rises with the monitored horizon (0.68→0.98 as T:24→400)… view at source ↗

read the original abstract

AI-control monitors score individual agent actions to detect misbehavior, but real harm can be distributed across many benign-looking steps, each individually below any per-step alarm. We construct a marginal-preserving, correlation-encoded distributed-sabotage attack using a Gaussian-copula AR(1) construction: the per-step monitor-score marginal is held exactly equal to benign, so mean, max, top-k tail, and threshold monitors (Monitor A) are defeated by construction, while harm is encoded in the temporal correlation structure. We sequence the paper around three reviewer-mandated gates. (1) Realizability gate: the stealthy attack achieves KS-distance to benign of 0.013 (effectively zero) at all tested harm levels up to 3.0, confirming that harm is fully decoupled from the per-step marginal and realizability is not harm-limited. (2) Monitor-A-vs-B reconciliation: we show formally that the attack, built against Monitor A's score marginal, remains marginal-preserving under a different-score Monitor B (the correlation/sequence family: CUSUM, SPRT, HMM-LR, runs test, autocorrelation, windowed logistic), and scope worst-case claims to score functions that admit a temporal signature. (3) Non-empty detectability band: Monitor A achieves AUC 0.52 (chance); Monitor B spans AUC 0.79-0.97 at the same 1% FPR target, and as harm is amortized over more steps Monitor A collapses to chance while Monitor B holds at AUC ~0.95. These results demonstrate a non-empty detectability band and characterize the sub-threshold sabotage frontier: distribution-shape monitors fail by construction; temporal-correlation monitors can detect but are not trivially optimal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Gaussian-copula AR(1) construction isolates a clean gap where per-step monitors fail by design but temporal ones retain power, provided the cross-monitor marginal preservation holds.

read the letter

The main takeaway is that this paper gives an explicit way to build distributed attacks that match benign per-step score marginals exactly while packing harm into the correlation structure. Monitor A (mean, max, threshold) drops to chance-level AUC while Monitor B (CUSUM, SPRT, autocorrelation) stays in the 0.79-0.97 range at 1% FPR, and the gap persists as harm spreads over more steps.

What is new is the specific Gaussian-copula AR(1) mechanism that decouples the marginal from the harm level, plus the three-gate structure that forces the authors to address realizability, cross-monitor consistency, and the resulting detectability band. The reported KS distance of 0.013 at harm levels up to 3.0 is concrete evidence that the construction works on the numbers they show.

The formal reconciliation step for Monitor B is the part that matters most. The abstract states they prove the attack remains marginal-preserving under B's different scoring function when limited to monitors that admit a temporal signature; if that proof is tight, the A-vs-B comparison is controlled. The stress-test worry about non-monotonic transforms does not appear to land once the scope is applied.

Soft spots are minor but real. Everything rests on simulation, so the exact AR(1) coefficient choices and how the harm amortization is implemented need to be reproducible. The paper does not claim optimality for the temporal family, which keeps the claim proportionate.

This is for people working on AI agent monitoring and stealthy attack detection. It deserves a serious referee because the construction is explicit, the gates are addressed, and the quantitative gap is falsifiable.

Referee Report

1 major / 0 minor

Summary. The paper constructs a Gaussian-copula AR(1) attack that exactly preserves per-step monitor-score marginals (KS-distance 0.013), defeating mean/max/top-k/threshold monitors (Monitor A) by construction while encoding harm in temporal correlations; it then shows formally that the same sequences remain marginal-preserving under a different-score Monitor B (CUSUM, SPRT, HMM-LR, etc.) and reports AUC 0.52 for A versus 0.79-0.97 for B at 1% FPR, with A collapsing to chance as harm is amortized while B holds at ~0.95, establishing a non-empty detectability band.

Significance. If the formal reconciliation and simulation results are reproducible, the work would be significant for AI-control research by providing an explicit, parameter-light construction that separates marginal and correlation effects and by quantifying the frontier between distribution-shape and temporal monitors. The three-gate structure, the reported KS distance, and the scoped worst-case claim for score functions admitting temporal signatures are concrete strengths.

major comments (1)

[Monitor-A-vs-B reconciliation] Monitor-A-vs-B reconciliation paragraph: the claim that the copula defined on Monitor A's marginal automatically preserves the marginal under Monitor B requires an explicit statement of the functional relationship between the two score functions. If B's scoring is a non-monotonic or non-linear function of the underlying actions (rather than a monotonic transform of A's scores), the induced B-marginal need not match the benign distribution; this is load-bearing for attributing B's AUC gains solely to temporal structure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive comment on the Monitor-A-vs-B reconciliation. The observation correctly identifies a point where greater explicitness is needed to support the formal claim.

read point-by-point responses

Referee: [Monitor-A-vs-B reconciliation] Monitor-A-vs-B reconciliation paragraph: the claim that the copula defined on Monitor A's marginal automatically preserves the marginal under Monitor B requires an explicit statement of the functional relationship between the two score functions. If B's scoring is a non-monotonic or non-linear function of the underlying actions (rather than a monotonic transform of A's scores), the induced B-marginal need not match the benign distribution; this is load-bearing for attributing B's AUC gains solely to temporal structure.

Authors: We agree that the reconciliation paragraph requires an explicit statement of the functional relationship. The Gaussian-copula AR(1) construction is defined directly on the per-step scores produced by Monitor A, so the marginal under A is preserved by design. Monitor B is applied to the identical underlying action sequences; the B-family monitors (CUSUM, SPRT, HMM-LR, runs test, autocorrelation, windowed logistic) are sequence-level statistics rather than arbitrary non-monotonic transforms of single-step A scores. Because the attack preserves the per-step marginal of A while only altering temporal dependence, the induced marginal under these particular B monitors remains identical to the benign marginal (as confirmed by the reported KS distances). We will revise the manuscript to state this relationship explicitly, to scope the formal claim to score functions whose temporal signatures are functions of the preserved per-step marginal, and to note that the claim does not extend to arbitrary non-monotonic B. This revision will be made in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit copula construction and simulation results are self-contained

full rationale

The paper defines the attack via an explicit Gaussian-copula AR(1) process that enforces exact per-step marginal equality to the benign distribution for Monitor A by construction, then reports direct simulation (KS=0.013) and AUC values from that construction. The Monitor-A-vs-B reconciliation is stated as a formal argument without reducing to a fitted parameter renamed as prediction or to any self-citation chain. No quoted equation or step equates a claimed prediction back to its own inputs; the detectability-band results therefore remain independent of the circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract only; cannot fully audit. The construction relies on the statistical assumption that copulas can decouple marginals from dependence structure.

free parameters (1)

AR(1) correlation coefficient
Central to encoding temporal harm while preserving marginals; value not reported in abstract.

axioms (1)

domain assumption Per-step monitor scores admit a copula representation that exactly preserves the benign marginal under the attack.
Invoked in the realizability gate to defeat marginal monitors by construction.

pith-pipeline@v0.9.1-grok · 5842 in / 1298 out tokens · 31006 ms · 2026-06-27T12:56:36.823009+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 10 canonical work pages · 6 internal anchors

[1]

LLMs can covertly sandbag on capability evaluations against chain-of-thought monitoring

Anonymous. LLMs can covertly sandbag on capability evaluations against chain-of-thought monitoring. 2025

2025
[2]

Natural emergent misalignment from reward hacking in production RL.arXiv preprint arXiv:2511.18397, 2025

Anthropic. Natural emergent misalignment from reward hacking in production RL.arXiv preprint arXiv:2511.18397, 2025

work page arXiv 2025
[3]

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Bowen Baker et al. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Nikiforov.Detection of Abrupt Changes: Theory and Application

Michele Basseville and Igor V . Nikiforov.Detection of Abrupt Changes: Theory and Application. Prentice Hall, 1993

1993
[5]

Bowman and David Duvenaud , title =

Gregory Benton et al. Sabotage evaluations for frontier models.arXiv preprint arXiv:2410.21514, 2024

work page arXiv 2024
[6]

Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

Collin Burns et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. In International Conference on Machine Learning, 2024. arXiv:2312.09390

work page arXiv 2024
[7]

Reasoning Models Don't Always Say What They Think

Michael Chen et al. Reasoning models don’t always say what they think.arXiv preprint arXiv:2505.05410, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Carson Denison, Evan Hubinger, et al. Sycophancy to subterfuge: Investigating reward-tampering in large language models.arXiv preprint arXiv:2406.10162, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Alignment faking in large language models

Ryan Greenblatt, Carson Denison, Evan Hubinger, et al. Alignment faking in large language models.arXiv preprint arXiv:2412.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

AI control: Improving safety despite intentional subversion

Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Thomas. AI control: Improving safety despite intentional subversion. InAdvances in Neural Information Processing Systems, 2024. arXiv:2312.06942

work page arXiv 2024
[11]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomasz Korbak, Mikita Balesni, Stuart Barnes, Yoshua Bengio, et al. Chain of thought monitorability: A new and fragile opportunity for AI safety.arXiv preprint arXiv:2507.11473, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Frontier Models are Capable of In-context Scheming

Alexander Meinke, Bronson Schoen, J´er´emy Scheurer, et al. Frontier models are capable of in-context scheming. arXiv preprint arXiv:2412.04984, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Nelsen.An Introduction to Copulas

Roger B. Nelsen.An Introduction to Copulas. Springer, 2nd edition, 2006

2006
[14]

Jerzy Neyman and Egon S. Pearson. On the problem of the most efficient tests of statistical hypotheses.Philo- sophical Transactions of the Royal Society of London, 231:289–337, 1933

1933
[15]

E. S. Page.Continuous Inspection Schemes, volume 41. Biometrika, 1954

1954
[16]

Rabiner.A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, volume 77

Lawrence R. Rabiner.A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, volume 77. Proceedings of the IEEE, 1989

1989
[17]

Sklar.Fonctions de r ´epartition `a n dimensions et leurs marges, volume 8

M. Sklar.Fonctions de r ´epartition `a n dimensions et leurs marges, volume 8. Publications de l’Institut Statistique de l’Universit´e de Paris, 1959

1959
[18]

Wiley, 1947

Abraham Wald.Sequential Analysis. Wiley, 1947

1947
[19]

On a test whether two samples are from the same population.Annals of Mathematical Statistics, 11(2):147–162, 1940

Abraham Wald and Jacob Wolfowitz. On a test whether two samples are from the same population.Annals of Mathematical Statistics, 11(2):147–162, 1940. 11

1940

[1] [1]

LLMs can covertly sandbag on capability evaluations against chain-of-thought monitoring

Anonymous. LLMs can covertly sandbag on capability evaluations against chain-of-thought monitoring. 2025

2025

[2] [2]

Natural emergent misalignment from reward hacking in production RL.arXiv preprint arXiv:2511.18397, 2025

Anthropic. Natural emergent misalignment from reward hacking in production RL.arXiv preprint arXiv:2511.18397, 2025

work page arXiv 2025

[3] [3]

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Bowen Baker et al. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Nikiforov.Detection of Abrupt Changes: Theory and Application

Michele Basseville and Igor V . Nikiforov.Detection of Abrupt Changes: Theory and Application. Prentice Hall, 1993

1993

[5] [5]

Bowman and David Duvenaud , title =

Gregory Benton et al. Sabotage evaluations for frontier models.arXiv preprint arXiv:2410.21514, 2024

work page arXiv 2024

[6] [6]

Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

Collin Burns et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. In International Conference on Machine Learning, 2024. arXiv:2312.09390

work page arXiv 2024

[7] [7]

Reasoning Models Don't Always Say What They Think

Michael Chen et al. Reasoning models don’t always say what they think.arXiv preprint arXiv:2505.05410, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Carson Denison, Evan Hubinger, et al. Sycophancy to subterfuge: Investigating reward-tampering in large language models.arXiv preprint arXiv:2406.10162, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Alignment faking in large language models

Ryan Greenblatt, Carson Denison, Evan Hubinger, et al. Alignment faking in large language models.arXiv preprint arXiv:2412.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

AI control: Improving safety despite intentional subversion

Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Thomas. AI control: Improving safety despite intentional subversion. InAdvances in Neural Information Processing Systems, 2024. arXiv:2312.06942

work page arXiv 2024

[11] [11]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomasz Korbak, Mikita Balesni, Stuart Barnes, Yoshua Bengio, et al. Chain of thought monitorability: A new and fragile opportunity for AI safety.arXiv preprint arXiv:2507.11473, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Frontier Models are Capable of In-context Scheming

Alexander Meinke, Bronson Schoen, J´er´emy Scheurer, et al. Frontier models are capable of in-context scheming. arXiv preprint arXiv:2412.04984, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Nelsen.An Introduction to Copulas

Roger B. Nelsen.An Introduction to Copulas. Springer, 2nd edition, 2006

2006

[14] [14]

Jerzy Neyman and Egon S. Pearson. On the problem of the most efficient tests of statistical hypotheses.Philo- sophical Transactions of the Royal Society of London, 231:289–337, 1933

1933

[15] [15]

E. S. Page.Continuous Inspection Schemes, volume 41. Biometrika, 1954

1954

[16] [16]

Rabiner.A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, volume 77

Lawrence R. Rabiner.A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, volume 77. Proceedings of the IEEE, 1989

1989

[17] [17]

Sklar.Fonctions de r ´epartition `a n dimensions et leurs marges, volume 8

M. Sklar.Fonctions de r ´epartition `a n dimensions et leurs marges, volume 8. Publications de l’Institut Statistique de l’Universit´e de Paris, 1959

1959

[18] [18]

Wiley, 1947

Abraham Wald.Sequential Analysis. Wiley, 1947

1947

[19] [19]

On a test whether two samples are from the same population.Annals of Mathematical Statistics, 11(2):147–162, 1940

Abraham Wald and Jacob Wolfowitz. On a test whether two samples are from the same population.Annals of Mathematical Statistics, 11(2):147–162, 1940. 11

1940