pith. sign in

arxiv: 2606.11949 · v2 · pith:LGQDEVAWnew · submitted 2026-06-10 · 💻 cs.LG · cs.CR· stat.ML

Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers

Pith reviewed 2026-07-01 07:52 UTC · model grok-4.3

classification 💻 cs.LG cs.CRstat.ML
keywords shift detectionsafety classifiersKolmogorov-Smirnov testconformal predictiondistributional driftadversarial robustnessonline monitoringjailbreak detection
0
0 comments X

The pith

A sliding-window KS test on safety classifier scores detects distributional shifts at 86.6% valid detection with mean latency of 39.5 steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that deployed safety classifiers lose accuracy when input distributions shift but receive no warning until labels arrive. It introduces an online monitor that applies a sliding-window Kolmogorov-Smirnov statistic to the classifier scores and uses empirically set thresholds to raise alarms. Across a pre-registered experiment covering four classifiers, five shift types, and multiple regimes, the monitor triggers valid detections in most cases while keeping false alarms low. Attempts to restore coverage after an alarm via weighted conformal prediction fail in the original high-dimensional embedding space because source and target separate perfectly, but succeed after projection to low dimensions. The work also models an adaptive attacker who knows the monitor and derives the exact point at which the attacker can no longer improve without being detected.

Core claim

The sliding-window KS monitor on classifier scores achieves 86.6 percent valid detection at a mean latency of 39.5 steps across synthetic-onset, real-jailbreak, and adversarial regimes. Density-ratio estimation for conformal prediction collapses in 3584-4096 dimensional embeddings because logistic regression separates source from target perfectly, clipping all weights to zero; projection to 32 dimensions or fewer restores coverage. Score-disagreement monitoring functions as a GCG-specific canary rather than generic out-of-distribution detection. A monitor-aware attacker reaches a confidence-gated equilibrium and stalls at a performance gap of 1 over 2 lambda. A calibration-free scan martinga

What carries the argument

Sliding-window Kolmogorov-Smirnov statistic on classifier scores with empirically calibrated thresholds, together with the derived confidence-gated equilibrium gap of 1/(2 lambda) that bounds a monitor-aware attacker.

If this is right

  • Monitoring requires per-classifier tuning because of a classifier by shift interaction that explains 18.5 percent of variance.
  • Conformal reweighting after detection works only after the embedding dimension is reduced to 32 or fewer.
  • The detection signal is specific to GCG-style attacks and is not driven by architectural differences among classifiers.
  • An adaptive attacker cannot suppress the canary signal while it remains confident and therefore stalls at the exact gap of 1/(2 lambda).
  • The scan martingale controls false alarms at 1 percent or less without any per-model calibration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Production safety systems could route suspicious inputs to a slower but safer model or to human review once the monitor fires.
  • The low-dimensional projection step that rescues conformal prediction may apply to other high-dimensional importance-weighting tasks.
  • The security boundary calculation suggests that adding monitoring changes the attacker's optimal evasion strategy in a predictable way.
  • The martingale approach could be tested on shift types not included in the original experiment to confirm its calibration-free behavior.

Load-bearing premise

The score distributions produced by a classifier under normal and shifted inputs are different enough that a sliding-window KS test with fixed thresholds will reliably raise alarms across classifiers and shift regimes.

What would settle it

Observe a shift that measurably degrades classifier accuracy yet produces no alarm from the sliding-window KS monitor within roughly 40 steps, or find an adaptive attacker who improves performance beyond the gap of 1/(2 lambda) while the monitor remains confident.

Figures

Figures reproduced from arXiv: 2606.11949 by Jun Wen Leong.

Figure 1
Figure 1. Figure 1: Detection latency heatmap (classifier × shift condition). Darker cells indicate slower detection. The crossover interaction is visible: encoders detect paraphrase fast but adversarial suffix slow; decoders show the opposite pattern. 4.5 REPRODUCIBILITY Code, configurations, pre-registration document, and raw results are available at https://github.com/junwenleong/safety-classifier-shift-monitor. The pre-re… view at source ↗
Figure 1
Figure 1. Figure 1: Detection rate under ramped-onset adversary. The scan martingale dominates KS at low [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Null score distributions (in-distribution, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 2
Figure 2. Figure 2: Detection latency heatmap (classifier × shift condition). Darker cells indicate slower detection. The crossover interaction is visible: encoders detect paraphrase fast but adversarial suffix slow; decoders show the opposite pattern. 5 RESULTS 5.1 RQ1: DETECTION PERFORMANCE The system detects shift in 693 of 800 cells (86.6% valid detection rate, 95% Wilson CI [0.841, 0.888]), with empirical false alarm rat… view at source ↗
Figure 3
Figure 3. Figure 3: Variance decomposition of detection latency. All three systematic factors contribute sub [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Regime C: KS statistic trajectories normalized by per-classifier threshold ( [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Variance decomposition of detection latency. All three systematic factors contribute sub [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Regime C: KS statistic trajectories normalized by per-classifier threshold ( [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Left: Detection rate by threat tier ( [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Left: Canary baseline confidence (fB original) vs. transfer outcome. Points above the 0.99 threshold (dashed line) almost never transfer. Right: Transfer rate under single-target vs. joint optimisation at 50 GCG steps. 8.1 THREAT MODEL Tier 1 is the baseline from §7. Tiers 2–4 represent progressively stronger adversaries. We evaluate Tiers 2–4 below. 8.2 TRANSFER ANALYSIS AND CONFIDENCE GATING A Tier 2 att… view at source ↗
Figure 8
Figure 8. Figure 8: Left: Ceiling-clipped models (score everything ≈ 1.0) vs. discriminating models (benign ≈ 0.0, adversarial ≈ 0.7–0.9). Right: Violin plot of ∆(adv − clean) across 980 prompt–model pairs, showing the mass at ≈ 0 with a tail of per-prompt collapses. 8.6 FRONTIER LLMS AS SEMANTIC CANARIES The canary classifiers evaluated in §7–8 are local models requiring GPU inference. A natural question is whether frontier … view at source ↗
read the original abstract

Safety classifiers deployed in production operate under a stationarity assumption that fails silently: when input distributions drift, accuracy degrades with no error signal until ground-truth labels arrive. We present an online monitor that detects distributional shift in classifier scores via a sliding-window KS statistic with empirically calibrated alarm thresholds. In a pre-registered factorial evaluation (4 classifiers $\times$ 5 shift conditions $\times$ 20 seeds $\times$ 2 window sizes; 800 cells), the monitor achieves 86.6% valid detection (mean latency 39.5 steps) across synthetic-onset, real-jailbreak, and adversarial regimes; a classifier $\times$ shift interaction ($\eta^2 = 0.185$) shows that monitoring must be tuned per classifier. Attempting to recover post-detection coverage via weighted conformal prediction exposes a failure mode: density-ratio estimation collapses for generative classifiers because logistic regression separates source from target perfectly in 3584-4096-dimensional embedding space, clipping all importance weights to zero; projecting to $\leq 32$ dimensions restores coverage. We then extend the framework to gradient-based evasion and give the first threat-model characterisation of score-disagreement monitoring as a canary. We falsify three assumptions: that architectural diversity drives the signal (false, $\eta^2 = 0.011$), that it is generic out-of-distribution detection (false, GCG-specific, $p < 10^{-12}$), and that an adaptive attacker can suppress it (false while the canary is confident). We derive the exact security boundary, a confidence-gated equilibrium at which a monitor-aware attacker stalls at gap $= 1/(2\lambda)$, and provide a calibration-free scan martingale achieving false-alarm rate $\leq 1\%$ across all classifiers with no per-model tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to present an online monitor for detecting distributional shift in deployed safety classifiers using a sliding-window KS statistic with empirically calibrated thresholds. Through a pre-registered factorial evaluation involving 4 classifiers, 5 shift conditions, 20 seeds, and 2 window sizes (800 cells), it reports 86.6% valid detection with mean latency of 39.5 steps across synthetic, jailbreak, and adversarial regimes. It identifies a classifier × shift interaction (η² = 0.185) necessitating per-classifier tuning, describes failure modes in conformal prediction due to high-dimensional embedding collapse (mitigated by projection to ≤32 dimensions), characterizes gradient-based evasion threats using score-disagreement as a canary, falsifies assumptions regarding architectural diversity and generic OOD detection, derives a security boundary at gap = 1/(2λ), and introduces a calibration-free scan martingale achieving ≤1% false-alarm rate across classifiers without per-model tuning.

Significance. If the central claims hold, this work offers a practical framework for monitoring safety classifiers in production with both empirical validation and theoretical analysis of security boundaries. The pre-registered design, quantitative falsification tests (e.g., η² = 0.011 for architectural diversity, p < 10^{-12} for GCG-specificity), and the introduction of a potentially tuning-free martingale are notable strengths. It addresses a critical gap in deployed AI safety systems by providing tools for shift detection and adaptation.

major comments (2)
  1. Abstract: The assertion that the calibration-free scan martingale achieves false-alarm rate ≤1% across all classifiers with no per-model tuning is not supported by the same level of empirical detail as the KS monitor's results from the 800-cell design. The reported classifier × shift interaction (η² = 0.185) and per-classifier tuning requirement apply to the KS statistic, but no equivalent per-classifier FAR numbers, ablations, or confirmation that the martingale was evaluated on the identical factorial design are provided. This is load-bearing for the claim of generalization without tuning.
  2. Abstract: The derivation of the exact security boundary (gap = 1/(2λ)) is stated, but the independence of λ from the performance data used to claim the monitor's effectiveness is not explicitly demonstrated, raising a potential circularity concern for the threat-model characterization.
minor comments (1)
  1. Abstract: The dimensions '3584-4096' for the embedding space where logistic regression separates source from target perfectly should be tied to specific model architectures or a methods section for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments identify areas where additional empirical detail and clarification would strengthen the presentation. We address each point below and will revise the manuscript to incorporate the requested information.

read point-by-point responses
  1. Referee: Abstract: The assertion that the calibration-free scan martingale achieves false-alarm rate ≤1% across all classifiers with no per-model tuning is not supported by the same level of empirical detail as the KS monitor's results from the 800-cell design. The reported classifier × shift interaction (η² = 0.185) and per-classifier tuning requirement apply to the KS statistic, but no equivalent per-classifier FAR numbers, ablations, or confirmation that the martingale was evaluated on the identical factorial design are provided. This is load-bearing for the claim of generalization without tuning.

    Authors: We acknowledge that the martingale results receive less granular reporting than the KS monitor. The martingale was evaluated on the same four classifiers under the pre-registered design, but per-classifier FAR breakdowns and explicit confirmation of the 800-cell factorial structure were not included. In revision we will add a supplementary table with per-classifier false-alarm rates for the martingale together with a statement confirming the shared evaluation protocol. This directly addresses the concern about the strength of the generalization claim. revision: yes

  2. Referee: Abstract: The derivation of the exact security boundary (gap = 1/(2λ)) is stated, but the independence of λ from the performance data used to claim the monitor's effectiveness is not explicitly demonstrated, raising a potential circularity concern for the threat-model characterization.

    Authors: λ is introduced in the threat-model section as the fixed step-size parameter of the gradient-based attacker and is defined prior to any empirical results. The monitor-effectiveness statistics (detection rates, latency) are reported separately and do not enter the derivation of the boundary. We will insert an explicit sentence in the revised manuscript stating that λ is an attacker hyper-parameter independent of the observed performance data, thereby removing any appearance of circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a pre-registered empirical evaluation (4 classifiers × 5 shifts × 20 seeds) for the sliding-window KS monitor, reports an interaction effect requiring per-classifier tuning, and separately states a mathematical derivation of a security boundary (gap = 1/(2λ)) plus a calibration-free scan martingale. No quoted equation or section reduces the claimed FAR ≤1% result, the security boundary, or the detection performance to a fitted parameter or self-citation by construction. The martingale is described as independent of the KS thresholds, and the derivation chain remains self-contained without load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Claims rest on empirical calibration of KS thresholds, the assumption that score distributions differ detectably under shift, and standard statistical assumptions for the martingale and conformal methods; no new entities postulated.

free parameters (2)
  • alarm thresholds
    Empirically calibrated alarm thresholds for the sliding-window KS statistic.
  • window sizes
    Two window sizes tested in the factorial evaluation.
axioms (2)
  • domain assumption Classifier output scores are suitable for distributional comparison via the Kolmogorov-Smirnov statistic under both stationary and shifted regimes.
    Invoked in the design of the online monitor.
  • domain assumption The pre-registered factorial design (4 classifiers × 5 shift conditions × 20 seeds × 2 window sizes) adequately samples the space of deployed safety classifier behavior.
    Basis for the reported 86.6% detection rate and interaction effects.

pith-pipeline@v0.9.1-grok · 5857 in / 1604 out tokens · 45758 ms · 2026-07-01T07:52:13.654181+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 10 canonical work pages · 5 internal anchors

  1. [1]

    Annals of Mathematical Statistics , volume=

    Sequential tests of statistical hypotheses , author=. Annals of Mathematical Statistics , volume=

  2. [2]

    Journal of the Royal Statistical Society Series B , volume=

    Estimating means of bounded random variables by betting , author=. Journal of the Royal Statistical Society Series B , volume=

  3. [3]

    JMLR , volume=

    A kernel two-sample test , author=. JMLR , volume=

  4. [4]

    NeurIPS , year=

    B-tests: Low variance kernel two-sample tests , author=. NeurIPS , year=

  5. [5]

    Algorithmic Learning in a Random World , author=

  6. [6]

    NeurIPS , year=

    Conformal prediction under covariate shift , author=. NeurIPS , year=

  7. [7]

    NeurIPS , year=

    Adaptive conformal inference under distribution shift , author=. NeurIPS , year=

  8. [8]

    NeurIPS , year=

    Classification with valid and adaptive coverage , author=. NeurIPS , year=

  9. [9]

    ICLR , year=

    Leveraging unlabeled data to predict out-of-distribution performance , author=. ICLR , year=

  10. [10]

    NeurIPS , year=

    Failing loudly: An empirical study of methods for detecting dataset shift , author=. NeurIPS , year=

  11. [12]

    WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

    WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs , author=. arXiv preprint arXiv:2406.18495 , year=

  12. [13]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Llama Guard: LLM-based input-output safeguard for human-AI conversations , author=. arXiv preprint arXiv:2312.06674 , year=

  13. [14]

    ShieldGemma: Generative AI Content Moderation Based on Gemma

    ShieldGemma: Generative AI content moderation based on Gemma , author=. arXiv preprint arXiv:2407.21772 , year=

  14. [15]

    Neurocomputing , year=

    Reactive Soft Prototype Computing for Concept Drift Streams , author=. Neurocomputing , year=

  15. [16]

    ICLR , year=

    Tracking the risk of a deployed model and detecting harmful distribution shifts , author=. ICLR , year=

  16. [17]

    2025 , note=

    Prinster, Drew and Han, Xing and Liu, Anqi and Saria, Suchi , booktitle=. 2025 , note=

  17. [19]

    NeurIPS , year=

    Telescoping Density-Ratio Estimation , author=. NeurIPS , year=

  18. [20]

    Brittlebench: Quantifying

    Romanou, Angelika and Ibrahim, Mark and Ross, Candace and Shaib, Chantal and Oktar, Kerem and Bell, Samuel J and Ovalle, Anaelia and Dodge, Jesse and Bosselut, Antoine and Sinha, Koustuv and Williams, Adina , journal=. Brittlebench: Quantifying

  19. [21]

    AISTATS , pages=

    Low-Dimensional Density Ratio Estimation for Covariate Shift Correction , author=. AISTATS , pages=. 2019 , volume=

  20. [22]

    Neural Networks , volume=

    Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search , author=. Neural Networks , volume=

  21. [24]

    Annals of Statistics , volume=

    Time-uniform, nonparametric, nonasymptotic confidence sequences , author=. Annals of Statistics , volume=

  22. [25]

    Statistical Science , volume=

    Testing randomness online , author=. Statistical Science , volume=

  23. [26]

    Proceedings of the 20th International Conference on Machine Learning (ICML) , pages=

    Testing exchangeability on-line , author=. Proceedings of the 20th International Conference on Machine Learning (ICML) , pages=

  24. [27]

    Statistical Science , volume=

    Game-theoretic statistics and safe anytime-valid inference , author=. Statistical Science , volume=

  25. [28]

    Shin, Taylor and Razeghi, Yasaman and Logan IV, Robert L and Wallace, Eric and Singh, Sameer , booktitle=

  26. [29]

    ICLR , year=

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author=. ICLR , year=

  27. [30]

    Beyond Static Benchmarks: Stateful Attack-Defense Evaluation with Uncertainty for

    Leong, Jun Wen , journal=. Beyond Static Benchmarks: Stateful Attack-Defense Evaluation with Uncertainty for

  28. [31]

    Biometrika , volume=

    Continuous inspection schemes , author=. Biometrika , volume=

  29. [32]

    Journal of the Royal Statistical Society Series B , volume=

    Safe testing , author=. Journal of the Royal Statistical Society Series B , volume=

  30. [33]

    Leveraging unlabeled data to predict out-of-distribution performance

    Saurabh Garg, Sivaraman Balakrishnan, Zachary C Lipton, Behnam Neyshabur, and Hanie Sedghi. Leveraging unlabeled data to predict out-of-distribution performance. In ICLR, 2022

  31. [34]

    Adaptive conformal inference under distribution shift

    Isaac Gibbs and Emmanuel Cand \`e s. Adaptive conformal inference under distribution shift. In NeurIPS, 2021

  32. [35]

    A kernel two-sample test

    Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch \"o lkopf, and Alexander Smola. A kernel two-sample test. JMLR, 13: 0 723--773, 2012

  33. [36]

    Time-uniform, nonparametric, nonasymptotic confidence sequences

    Steven R Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform, nonparametric, nonasymptotic confidence sequences. Annals of Statistics, 49 0 (2): 0 1055--1080, 2021

  34. [37]

    Autodan: Generating stealthy jailbreak prompts on aligned large language models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. In ICLR, 2024

  35. [38]

    Tracking the risk of a deployed model and detecting harmful distribution shifts

    Aleksandr Podkopaev and Aaditya Ramdas. Tracking the risk of a deployed model and detecting harmful distribution shifts. In ICLR, 2022. arXiv:2110.06177

  36. [39]

    WATCH : Adaptive monitoring for AI deployments via weighted-conformal martingales

    Drew Prinster, Xing Han, Anqi Liu, and Suchi Saria. WATCH : Adaptive monitoring for AI deployments via weighted-conformal martingales. In International Conference on Machine Learning (ICML), 2025. arXiv:2505.04608

  37. [40]

    Reactive soft prototype computing for concept drift streams

    Christoph Raab, Moritz Heusinger, and Frank-Michael Schleif. Reactive soft prototype computing for concept drift streams. Neurocomputing, 2020. arXiv:2007.05432

  38. [41]

    Failing loudly: An empirical study of methods for detecting dataset shift

    Stephan Rabanser, Stephan G \"u nnemann, and Zachary C Lipton. Failing loudly: An empirical study of methods for detecting dataset shift. In NeurIPS, 2019

  39. [42]

    Classification with valid and adaptive coverage

    Yaniv Romano, Matteo Sesia, and Emmanuel Cand \`e s. Classification with valid and adaptive coverage. In NeurIPS, 2020

  40. [43]

    Brittlebench: Quantifying LLM robustness via prompt sensitivity

    Angelika Romanou, Mark Ibrahim, Candace Ross, Chantal Shaib, Kerem Oktar, Samuel J Bell, Anaelia Ovalle, Jesse Dodge, Antoine Bosselut, Koustuv Sinha, and Adina Williams. Brittlebench: Quantifying LLM robustness via prompt sensitivity. arXiv preprint arXiv:2603.13285, 2026

  41. [44]

    I can't believe it's not robust: Catastrophic collapse of safety classifiers under embedding drift

    Subramanyam Sahoo, Vinija Jain, Divya Chaudhary, and Aman Chadha. I can't believe it's not robust: Catastrophic collapse of safety classifiers under embedding drift. arXiv preprint arXiv:2603.01297, 2026

  42. [45]

    Low-dimensional density ratio estimation for covariate shift correction

    Petar Stojanov, Mingming Gong, Jaime Carbonell, and Kun Zhang. Low-dimensional density ratio estimation for covariate shift correction. In AISTATS, volume 89 of PMLR, pp.\ 3449--3458, 2019

  43. [46]

    Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search

    Masashi Sugiyama, Makoto Yamada, Paul von B \"u nau, Taiji Suzuki, Takafumi Kanamori, and Motoaki Kawanabe. Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search. Neural Networks, 24 0 (2): 0 183--198, 2011

  44. [47]

    Conformal prediction under covariate shift

    Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Cand \`e s, and Aaditya Ramdas. Conformal prediction under covariate shift. In NeurIPS, 2019

  45. [48]

    A collaborative content moderation framework for toxicity detection based on conformalized estimates of annotation disagreement

    Guillermo Villate-Castillo, Javier Del Ser, and Borja Sanz. A collaborative content moderation framework for toxicity detection based on conformalized estimates of annotation disagreement. arXiv preprint arXiv:2411.04090, 2024

  46. [49]

    Algorithmic Learning in a Random World

    Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005

  47. [50]

    Sequential tests of statistical hypotheses

    Abraham Wald. Sequential tests of statistical hypotheses. Annals of Mathematical Statistics, 16 0 (2): 0 117--186, 1945

  48. [51]

    Estimating means of bounded random variables by betting

    Ian Waudby-Smith and Aaditya Ramdas. Estimating means of bounded random variables by betting. Journal of the Royal Statistical Society Series B, 86 0 (1): 0 1--27, 2024

  49. [52]

    B-tests: Low variance kernel two-sample tests

    Wojciech Zaremba, Arthur Gretton, and Matthew Blaschko. B-tests: Low variance kernel two-sample tests. In NeurIPS, 2013

  50. [53]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023