pith. sign in

arxiv: 2606.10456 · v1 · pith:OQKJOKDQnew · submitted 2026-06-09 · 💻 cs.CR · cs.AI

The Distributed Detectability Band Against Marginal-Preserving Attacks

Pith reviewed 2026-06-27 12:56 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords distributed sabotagemarginal-preserving attacktemporal correlationAI control monitorsdetectability bandGaussian copulasequence monitoringAR(1) process
0
0 comments X

The pith

A Gaussian-copula AR(1) attack encodes harm in temporal correlations while exactly preserving per-step score marginals, defeating marginal monitors but not sequence monitors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AI-control monitors that score individual actions can miss distributed harm spread across many steps that each appear normal. The paper builds an attack that holds the per-step monitor-score distribution identical to benign behavior, so monitors using mean, max, or threshold rules fail by construction. Harm is instead placed in the correlation structure across steps. Sequence monitors that examine temporal patterns still detect the attack, producing a gap between undetectable and detectable cases even as the harm is amortized over longer sequences.

Core claim

The paper constructs a marginal-preserving, correlation-encoded distributed-sabotage attack via Gaussian-copula AR(1) that keeps the per-step score marginal exactly equal to the benign case, defeats all tested marginal monitors, remains marginal-preserving under different score functions, and is detected by temporal monitors with AUC 0.79-0.97 at 1% FPR while marginal monitors stay at chance; the band persists as harm is spread over more steps.

What carries the argument

Gaussian-copula AR(1) construction that holds the per-step monitor-score marginal fixed while placing harm in the temporal correlation structure

If this is right

  • Marginal monitors including mean, max, top-k tail, and threshold rules are defeated by construction.
  • Temporal monitors such as CUSUM, SPRT, HMM-LR, runs test, autocorrelation, and windowed logistic achieve AUC 0.79-0.97 at the same 1% FPR target.
  • The performance gap remains when harm is amortized over increasing numbers of steps.
  • The attack scope is limited to score functions that admit a temporal signature.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid monitors that combine marginal checks with explicit correlation tests would be required to close the identified band.
  • The same construction could be applied to non-Gaussian marginals or higher-order dependence structures to test the generality of the band.
  • Real agent deployments would need to log full action sequences rather than aggregated scores to enable the temporal monitors shown to work.
  • The realizability gate implies that future monitor evaluations must test against correlation-encoded attacks rather than only per-step perturbations.

Load-bearing premise

The Gaussian-copula AR(1) construction can exactly preserve the per-step score marginal while encoding arbitrary harm levels in the correlation structure.

What would settle it

An experiment in which the KS-distance between the attack and benign per-step score distributions exceeds 0.05 at harm level 3.0, or in which every temporal monitor tested against the attack yields AUC no higher than 0.55 at 1% FPR.

Figures

Figures reproduced from arXiv: 2606.10456 by Gao Yuze, Zhang Qinqin.

Figure 1
Figure 1. Figure 1: Distributed sub-threshold sabotage threat model. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Realizability gate: KS-distance vs. harm budget. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ROC curves: Monitor A vs. Monitor B (marginal-preserving attack). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: AUC vs. harm spread (money figure). As harm is amortized over more steps (larger spread K), Monitor A collapses to AUC ≈ 0.52 (chance), while Monitor B (SPRT shown) holds at AUC ≈ 0.95. This is the key non-empty detectability band result: the families diverge under the distributional condition that makes the attack most realistic. (it is marginal-preserving for every δϕ). Monitor B degrades gracefully: the… view at source ↗
Figure 5
Figure 5. Figure 5: KS-distance to benign across stealth/spread parameter grid. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Adaptive red-teaming. (a) Detectability frontier: as the attacker lowers its coordination δϕ, Monitor A stays at chance while Monitor B’s AUC decays toward 0.5; the parameter-free autocorrelation detector crosses the evasion line (AUC= 0.65, orange) at δ ⋆ ϕ=0.05. (b) The defender’s counter: at a stealthy δϕ=0.05, accumulating more steps T restores Monitor B’s AUC (autocorr 0.62→0.86 as T:100→1600), while … view at source ↗
Figure 7
Figure 7. Figure 7: Real agent/monitor traces. (a) Real benign per-step monitor scores (Qwen2.5-7B agent, Llama-3.2-3B monitor) with a Beta fit; real autocorrelation is low (ϕ = 0.06). (b) The marginal-preserving attack (KS = 0.01 to real benign) leaves Monitor A at chance; Monitor B is above chance but modest on short real episodes. (c) Holding N fixed, Monitor B’s AUC rises with the monitored horizon (0.68→0.98 as T:24→400)… view at source ↗
read the original abstract

AI-control monitors score individual agent actions to detect misbehavior, but real harm can be distributed across many benign-looking steps, each individually below any per-step alarm. We construct a marginal-preserving, correlation-encoded distributed-sabotage attack using a Gaussian-copula AR(1) construction: the per-step monitor-score marginal is held exactly equal to benign, so mean, max, top-k tail, and threshold monitors (Monitor A) are defeated by construction, while harm is encoded in the temporal correlation structure. We sequence the paper around three reviewer-mandated gates. (1) Realizability gate: the stealthy attack achieves KS-distance to benign of 0.013 (effectively zero) at all tested harm levels up to 3.0, confirming that harm is fully decoupled from the per-step marginal and realizability is not harm-limited. (2) Monitor-A-vs-B reconciliation: we show formally that the attack, built against Monitor A's score marginal, remains marginal-preserving under a different-score Monitor B (the correlation/sequence family: CUSUM, SPRT, HMM-LR, runs test, autocorrelation, windowed logistic), and scope worst-case claims to score functions that admit a temporal signature. (3) Non-empty detectability band: Monitor A achieves AUC 0.52 (chance); Monitor B spans AUC 0.79-0.97 at the same 1% FPR target, and as harm is amortized over more steps Monitor A collapses to chance while Monitor B holds at AUC ~0.95. These results demonstrate a non-empty detectability band and characterize the sub-threshold sabotage frontier: distribution-shape monitors fail by construction; temporal-correlation monitors can detect but are not trivially optimal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper constructs a Gaussian-copula AR(1) attack that exactly preserves per-step monitor-score marginals (KS-distance 0.013), defeating mean/max/top-k/threshold monitors (Monitor A) by construction while encoding harm in temporal correlations; it then shows formally that the same sequences remain marginal-preserving under a different-score Monitor B (CUSUM, SPRT, HMM-LR, etc.) and reports AUC 0.52 for A versus 0.79-0.97 for B at 1% FPR, with A collapsing to chance as harm is amortized while B holds at ~0.95, establishing a non-empty detectability band.

Significance. If the formal reconciliation and simulation results are reproducible, the work would be significant for AI-control research by providing an explicit, parameter-light construction that separates marginal and correlation effects and by quantifying the frontier between distribution-shape and temporal monitors. The three-gate structure, the reported KS distance, and the scoped worst-case claim for score functions admitting temporal signatures are concrete strengths.

major comments (1)
  1. [Monitor-A-vs-B reconciliation] Monitor-A-vs-B reconciliation paragraph: the claim that the copula defined on Monitor A's marginal automatically preserves the marginal under Monitor B requires an explicit statement of the functional relationship between the two score functions. If B's scoring is a non-monotonic or non-linear function of the underlying actions (rather than a monotonic transform of A's scores), the induced B-marginal need not match the benign distribution; this is load-bearing for attributing B's AUC gains solely to temporal structure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive comment on the Monitor-A-vs-B reconciliation. The observation correctly identifies a point where greater explicitness is needed to support the formal claim.

read point-by-point responses
  1. Referee: [Monitor-A-vs-B reconciliation] Monitor-A-vs-B reconciliation paragraph: the claim that the copula defined on Monitor A's marginal automatically preserves the marginal under Monitor B requires an explicit statement of the functional relationship between the two score functions. If B's scoring is a non-monotonic or non-linear function of the underlying actions (rather than a monotonic transform of A's scores), the induced B-marginal need not match the benign distribution; this is load-bearing for attributing B's AUC gains solely to temporal structure.

    Authors: We agree that the reconciliation paragraph requires an explicit statement of the functional relationship. The Gaussian-copula AR(1) construction is defined directly on the per-step scores produced by Monitor A, so the marginal under A is preserved by design. Monitor B is applied to the identical underlying action sequences; the B-family monitors (CUSUM, SPRT, HMM-LR, runs test, autocorrelation, windowed logistic) are sequence-level statistics rather than arbitrary non-monotonic transforms of single-step A scores. Because the attack preserves the per-step marginal of A while only altering temporal dependence, the induced marginal under these particular B monitors remains identical to the benign marginal (as confirmed by the reported KS distances). We will revise the manuscript to state this relationship explicitly, to scope the formal claim to score functions whose temporal signatures are functions of the preserved per-step marginal, and to note that the claim does not extend to arbitrary non-monotonic B. This revision will be made in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit copula construction and simulation results are self-contained

full rationale

The paper defines the attack via an explicit Gaussian-copula AR(1) process that enforces exact per-step marginal equality to the benign distribution for Monitor A by construction, then reports direct simulation (KS=0.013) and AUC values from that construction. The Monitor-A-vs-B reconciliation is stated as a formal argument without reducing to a fitted parameter renamed as prediction or to any self-citation chain. No quoted equation or step equates a claimed prediction back to its own inputs; the detectability-band results therefore remain independent of the circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract only; cannot fully audit. The construction relies on the statistical assumption that copulas can decouple marginals from dependence structure.

free parameters (1)
  • AR(1) correlation coefficient
    Central to encoding temporal harm while preserving marginals; value not reported in abstract.
axioms (1)
  • domain assumption Per-step monitor scores admit a copula representation that exactly preserves the benign marginal under the attack.
    Invoked in the realizability gate to defeat marginal monitors by construction.

pith-pipeline@v0.9.1-grok · 5842 in / 1298 out tokens · 31006 ms · 2026-06-27T12:56:36.823009+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 10 canonical work pages · 6 internal anchors

  1. [1]

    LLMs can covertly sandbag on capability evaluations against chain-of-thought monitoring

    Anonymous. LLMs can covertly sandbag on capability evaluations against chain-of-thought monitoring. 2025

  2. [2]

    Natural emergent misalignment from reward hacking in production RL.arXiv preprint arXiv:2511.18397, 2025

    Anthropic. Natural emergent misalignment from reward hacking in production RL.arXiv preprint arXiv:2511.18397, 2025

  3. [3]

    Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

    Bowen Baker et al. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926, 2025

  4. [4]

    Nikiforov.Detection of Abrupt Changes: Theory and Application

    Michele Basseville and Igor V . Nikiforov.Detection of Abrupt Changes: Theory and Application. Prentice Hall, 1993

  5. [5]

    Bowman and David Duvenaud , title =

    Gregory Benton et al. Sabotage evaluations for frontier models.arXiv preprint arXiv:2410.21514, 2024

  6. [6]

    Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

    Collin Burns et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. In International Conference on Machine Learning, 2024. arXiv:2312.09390

  7. [7]

    Reasoning Models Don't Always Say What They Think

    Michael Chen et al. Reasoning models don’t always say what they think.arXiv preprint arXiv:2505.05410, 2025

  8. [8]

    Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

    Carson Denison, Evan Hubinger, et al. Sycophancy to subterfuge: Investigating reward-tampering in large language models.arXiv preprint arXiv:2406.10162, 2024

  9. [9]

    Alignment faking in large language models

    Ryan Greenblatt, Carson Denison, Evan Hubinger, et al. Alignment faking in large language models.arXiv preprint arXiv:2412.14093, 2024

  10. [10]

    AI control: Improving safety despite intentional subversion

    Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Thomas. AI control: Improving safety despite intentional subversion. InAdvances in Neural Information Processing Systems, 2024. arXiv:2312.06942

  11. [11]

    Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

    Tomasz Korbak, Mikita Balesni, Stuart Barnes, Yoshua Bengio, et al. Chain of thought monitorability: A new and fragile opportunity for AI safety.arXiv preprint arXiv:2507.11473, 2025

  12. [12]

    Frontier Models are Capable of In-context Scheming

    Alexander Meinke, Bronson Schoen, J´er´emy Scheurer, et al. Frontier models are capable of in-context scheming. arXiv preprint arXiv:2412.04984, 2024

  13. [13]

    Nelsen.An Introduction to Copulas

    Roger B. Nelsen.An Introduction to Copulas. Springer, 2nd edition, 2006

  14. [14]

    Jerzy Neyman and Egon S. Pearson. On the problem of the most efficient tests of statistical hypotheses.Philo- sophical Transactions of the Royal Society of London, 231:289–337, 1933

  15. [15]

    E. S. Page.Continuous Inspection Schemes, volume 41. Biometrika, 1954

  16. [16]

    Rabiner.A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, volume 77

    Lawrence R. Rabiner.A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, volume 77. Proceedings of the IEEE, 1989

  17. [17]

    Sklar.Fonctions de r ´epartition `a n dimensions et leurs marges, volume 8

    M. Sklar.Fonctions de r ´epartition `a n dimensions et leurs marges, volume 8. Publications de l’Institut Statistique de l’Universit´e de Paris, 1959

  18. [18]

    Wiley, 1947

    Abraham Wald.Sequential Analysis. Wiley, 1947

  19. [19]

    On a test whether two samples are from the same population.Annals of Mathematical Statistics, 11(2):147–162, 1940

    Abraham Wald and Jacob Wolfowitz. On a test whether two samples are from the same population.Annals of Mathematical Statistics, 11(2):147–162, 1940. 11