arxiv: 2604.19461 · v1 · submitted 2026-04-21 · 💻 cs.CR

Recognition: unknown

Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4

Alex Polyakov , Daniel Kuznetsov

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:37 UTC · model grok-4.3

classification 💻 cs.CR

keywords in-context learningsafety alignmentjailbreaklarge language modelsprompt injectionpattern completionadversarial attacksfew-shot prompting

0 comments

The pith

Abstract few-shot examples force pattern completion that overrides safety training in GPT-5.4.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that safety alignments in large language models can be bypassed when harmful requests are framed as abstract semantic operators accompanied by few-shot examples that induce involuntary pattern completion. A sympathetic reader would care because current alignment relies on refusing direct harmful queries, yet this approach shows the refusal can be overridden by sufficiently strong in-context patterns without any weight changes. Experiments across thousands of probes on ten OpenAI models isolate the effective components, such as semantic naming and example ordering, and demonstrate that the same content in direct question-and-answer format produces no bypasses. The method is then evaluated on the HarmBench benchmark, where it yields detailed harmful responses at a 24 percent rate against a zero percent baseline for direct queries.

Core claim

The central claim is that safety alignment in models like GPT-5.4 can be overridden by Involuntary In-Context Learning, in which abstract operator framing with few-shot examples forces the model to complete a pattern that produces harmful outputs despite its trained refusal behaviors. Semantic operator naming achieves a 100 percent bypass rate in 50 trials, identical examples in direct format achieve zero percent, and interleaved ordering outperforms harmful-first ordering by a wide margin. On HarmBench the attack produces 619-word average responses at a 24 percent bypass rate where direct queries are refused entirely.

What carries the argument

Involuntary In-Context Learning (IICL), the mechanism of using abstract semantic operator names with few-shot examples to trigger pattern completion that competes with safety training.

Load-bearing premise

The observed bypasses result specifically from involuntary pattern completion competing with safety training rather than from other prompt artifacts or model-specific quirks.

What would settle it

Running identical harmful requests in direct question-and-answer format without abstract framing or few-shot examples on the same models, which the paper predicts will produce zero bypasses.

Figures

Figures reproduced from arXiv: 2604.19461 by Alex Polyakov, Daniel Kuznetsov.

**Figure 1.** Figure 1: Overview of the IICL attack pipeline. The attacker constructs a prompt with interleaved benign and harmful few-shot examples framed as abstract operator evaluations. The model’s in-context learning circuitry completes the pattern, overriding safety alignment to produce harmful content that satisfies the validation operator. 4.2 Formal Description We define two abstract operators. The first operator, op1 : … view at source ↗

**Figure 2.** Figure 2: EXP-1: Bypass rate as a function of harmful example count. The overall trend is increasing, with a minor non-monotonic dip at harm=3 (54% vs. 56% at harm=2) that falls within sampling noise, rising from 0% (0/50) with no harmful examples to 94% with 5. Error bars indicate 95% Wilson score confidence intervals. The 0-example condition yielded exactly 0 bypasses in all 50 probes, with a one-sided 95% upper b… view at source ↗

**Figure 3.** Figure 3: EXP-2: Bypass rate as a function of benign example count. The non-monotonic pattern (peaking at 5) indicates that benign examples serve a calibration function rather than acting as dilution. 6.3 EXP-3: Abstraction Layer Design. We compare four framing conditions: abstract (operator notation with X/Y), func (Python function syntax), direct (plain Q&A format with identical content), and none (raw harmful que… view at source ↗

**Figure 4.** Figure 4: EXP-3: Bypass rate by framing condition. Abstract operator framing (58%) is required for the attack; identical harmful content in direct Q&A format yields 0% (p < 0.001). ordering avoids this by introducing harmful examples gradually, interspersed with benign ones, so that the operator pattern is established before a critical mass of harmful content triggers safety refusal. The difference between interlea… view at source ↗

**Figure 5.** Figure 5: EXP-4: Bypass rate by example ordering. Interleaved ordering (76%) dominates; harmful-first (6%) triggers early safety detection. 6.5 EXP-5: Example Similarity Design. We compare three conditions for the relationship between few-shot example domains and the payload domain: similar (same category), dissimilar (different category), and mixed (combination of both). 11 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: EXP-6: Bypass rate by temperature. The flat profile (46–56%, chi-square p = 0.891) provides no evidence for a temperature effect, consistent with a pattern-recognition mechanism rather than a stochastic sampling artifact. 6.7 EXP-7: Operator Naming Design. We compare four operator naming conventions: answer/is_valid, X/Y (semantically neutral), process/check (procedural), and respond/is_harmful (explicitly… view at source ↗

**Figure 7.** Figure 7: EXP-7: Bypass rate by operator naming convention. The operator pair answer/is_valid achieves a perfect 100% (50/50), far exceeding semantically neutral (52%) or procedural (42%) alternatives (p < 0.001). 6.8 Combined Analysis [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: HarmBench per-category bypass rates. Social harms (harassment, fraud) are significantly more vulnerable than physical harms (violence, drugs). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: HarmBench per-query bypass heatmap across 10 repetitions. Each cell represents a single probe; dark cells indicate successful bypasses. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: visualizes the word count distribution across variants. V0-baseline V2-expand-op V6-two-stage-inline V3-quality-gradient V5-contrast V4-template V8-expand+template V9-gradient+template V7-footer+template V1-footer-amp 0 100 200 300 400 500 600 700 Word Count B=4 B=0 B=0 B=0 B=4 B=4 B=4 B=0 B=0 B=4 Selected for Detail Amplification: Word Count by Variant HarmBench Bypass (score >= 2) Refused (score < 2) [… view at source ↗

**Figure 12.** Figure 12: Per-model bypass rates across 10 models. The distribution is bimodal: 6 models are fully robust (0%) while 4 models are fragile (∼2–15%). All “pro” variants and gpt-5.2 are robust. 9.3 Attacker Success Rate The average bypass rate understates vulnerability from a security perspective because a persistent attacker only needs one successful probe per query. We define the attacker success rate (ASR) as the f… view at source ↗

read the original abstract

Safety alignment in large language models relies on behavioral training that can be overridden when sufficiently strong in-context patterns compete with learned refusal behaviors. We introduce Involuntary In-Context Learning (IICL), an attack class that uses abstract operator framing with few-shot examples to force pattern completion that overrides safety training. Through 3479 probes across 10 OpenAI models, we identify the attack's effective components through a seven-experiment ablation study. Key findings: (1)~semantic operator naming achieves 100\,\% bypass rate (50/50, $p < 0.001$); (2)~the attack requires abstract framing, since identical examples in direct question-and-answer format yield 0\,\%; (3)~example ordering matters strongly (interleaved: 76\,\%, harmful-first: 6\,\%); (4)~temperature has no meaningful effect (46--56\,\% across 0.0--1.0). On the HarmBench benchmark, IICL achieves 24.0\,\% bypass $[18.6\%, 30.4\%]$ against GPT-5.4 with detailed 619-word responses, compared to 0.0\,\% for direct queries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper demonstrates an effective prompt structure using abstract operators and few-shot examples that bypasses safety on GPT-5.4 at 24% on HarmBench, with useful ablations, but the involuntary ICL mechanism claim rests on incomplete controls.

read the letter

The main takeaway is that wrapping queries in abstract semantic operator examples plus few-shots produces a measurable bypass on current OpenAI models where straight prompts do not. They report 24% success on HarmBench with detailed outputs, 100% on a small semantic-naming test set, and clear differences from ordering and format changes. The 3479 probes and seven-experiment ablation give the work some empirical weight, and the confidence intervals plus p-values are a step above many prompt papers. Temperature independence is also a clean negative result worth noting. What stands out is the systematic check that direct Q&A versions drop to zero while the abstract version works, which isolates format as a real variable. The citation list stays reasonable for an attack paper and does not lean on self-referential loops. The soft spot is the mechanism story. The authors tie the effect to involuntary pattern completion overriding safety training, yet the ablations do not include a control that keeps the abstract operator language while dropping or altering the few-shot examples. Without that, the bypass could simply come from the abstract framing turning the query into a non-literal exercise, a pattern already known in other jailbreak work. The 50/50 sample for the 100% claim is also small enough that replication would help. This is aimed at red-teamers and safety evaluators who need concrete prompt attacks to test against. It is coherent on its own terms and reports falsifiable numbers, so it deserves a serious referee who can ask for the missing control and full templates. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Involuntary In-Context Learning (IICL) as an attack class that employs abstract semantic operator naming together with few-shot examples to induce pattern completion that overrides safety alignment in models such as GPT-5.4. It reports results from 3479 probes across 10 OpenAI models, a seven-experiment ablation study, a 100% bypass rate (50/50, p < 0.001) for semantic operator naming, 0% success when the same examples are presented in direct Q&A format, strong dependence on example ordering (76% interleaved vs. 6% harmful-first), temperature invariance, and a 24.0% bypass rate [18.6%, 30.4%] on HarmBench with detailed 619-word responses versus 0% for direct queries.

Significance. If the central attribution to involuntary in-context pattern completion competing with safety training holds after tighter controls, the work would provide useful empirical data on how behavioral alignment can be overridden by in-context signals, with potential value for alignment research and red-teaming practices. The scale of probes, statistical reporting, and benchmark comparison are positive features, but the current evidence does not yet isolate the proposed mechanism from other known jailbreak effects of abstract framing.

major comments (2)

[Ablation study] Ablation study (abstract and methods summary): The reported experiments establish that abstract framing is necessary (0% in direct Q&A) and that ordering matters, yet no condition tests abstract operator naming in the absence of the few-shot examples. Without this control, the bypass cannot be attributed specifically to involuntary pattern completion rather than the abstract framing itself reframing the query as a non-literal exercise.
[HarmBench results] HarmBench evaluation (abstract): The 24.0% bypass rate [18.6%, 30.4%] is presented without the exact prompt templates, probe selection criteria, or data exclusion rules used to construct the 3479 probes. This information is required to determine whether post-hoc choices influenced the central success figure and to allow independent verification.

minor comments (2)

[Abstract] The abstract states that a seven-experiment ablation was performed but does not enumerate the individual experiments or present their results in a summary table; adding such a table would improve readability and allow readers to assess the contribution of each component.
[Methods] The manuscript would benefit from explicit discussion of how the semantic operator prompts were constructed and whether they were optimized post-hoc, as this bears on the reproducibility of the 100% bypass claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important opportunities to strengthen the attribution of the observed effects to involuntary in-context pattern completion and to improve reproducibility. We address each major comment below and have prepared revisions to the manuscript accordingly.

read point-by-point responses

Referee: [Ablation study] Ablation study (abstract and methods summary): The reported experiments establish that abstract framing is necessary (0% in direct Q&A) and that ordering matters, yet no condition tests abstract operator naming in the absence of the few-shot examples. Without this control, the bypass cannot be attributed specifically to involuntary pattern completion rather than the abstract framing itself reframing the query as a non-literal exercise.

Authors: We agree that the absence of this specific control limits the strength of our mechanistic claim. While our existing results show that direct Q&A format (even with harmful content) yields 0% success and that ordering of examples strongly modulates the effect, we did not include a condition using only abstract operator naming without any few-shot examples. In the revised manuscript we will add this control to the ablation study, applying the same semantic operator names to the target queries without preceding examples. This will allow direct comparison of the incremental effect of the few-shot pattern-completion component. We will report the new results transparently, including any statistical comparisons, and update the discussion to reflect the updated evidence for the proposed mechanism. revision: yes
Referee: [HarmBench results] HarmBench evaluation (abstract): The 24.0% bypass rate [18.6%, 30.4%] is presented without the exact prompt templates, probe selection criteria, or data exclusion rules used to construct the 3479 probes. This information is required to determine whether post-hoc choices influenced the central success figure and to allow independent verification.

Authors: We acknowledge that these methodological details were insufficiently documented. The 3479 probes were generated by applying the IICL template (abstract operator name plus interleaved few-shot examples) to every behavior in the standard HarmBench harmful set with no post-hoc exclusions or filtering. Success was defined as the model producing a detailed, non-refusal response. In the revised manuscript we will add a new appendix that includes: (1) the exact prompt template and all semantic operator names used, (2) the full probe selection criteria (all standard HarmBench behaviors), (3) the precise success criteria and response-length threshold applied, and (4) the complete set of prompt-response pairs for the HarmBench evaluation. These materials will enable independent verification and replication. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical attack evaluation

full rationale

The manuscript reports direct experimental measurements of bypass rates across 3479 probes on 10 models, including a seven-experiment ablation study and HarmBench evaluations. All key results (100% bypass with semantic operator naming, 0% in direct Q&A, 24% HarmBench success) are obtained by counting observed model outputs against fixed prompts and benchmarks. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on falsifiable empirical counts rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that abstract framing triggers pattern completion that overrides safety training; no free parameters are fitted, and no new entities are postulated.

axioms (1)

domain assumption Large language models will complete abstract patterns presented in few-shot examples even when those patterns conflict with previously trained refusal behaviors.
This assumption is required for the attack to succeed and is tested indirectly through the ablation comparing abstract vs. direct formats.

pith-pipeline@v0.9.0 · 5519 in / 1405 out tokens · 33020 ms · 2026-05-10T02:37:28.217670+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Belief or Circuitry? Causal Evidence for In-Context Graph Learning
cs.AI 2026-05 conditional novelty 6.0

Causal evidence from representation analysis and interventions shows LLMs use both genuine structure inference and induction circuits in parallel for in-context graph learning.

Reference graph

Works this paper leans on

23 extracted references · cited by 1 Pith paper

[1]

Cem Anil and Esin Durmus and Nina Panickssery and Mrinank Sharma and Joe Benton and Sandipan Kundu and Joshua Batson and Meg Tong and Jesse Mu and Daniel Ford and Francesco Mosconi and Rajashree Agrawal and Rylan Schaeffer and Naomi Bashkansky and Samuel Svenningsen and Mike Lambert and Ansh Radhakrishnan and Carson Denison and Evan J. Hubinger and Yuntao...
[2]

Bowman and Zac Hatfield-Dodds and Ben Mann and Dario Amodei and Nicholas Joseph and Sam McCandlish and Tom Brown and Jared Kaplan , title =

Yuntao Bai and Saurav Kadavath and Sandipan Kundu and Amanda Askell and Jackson Kernion and Andy Jones and Anna Chen and Anna Goldie and Azalia Mirhoseini and Cameron McKinnon and Carol Chen and Catherine Olsson and Christopher Olah and Danny Hernandez and Dawn Drain and Deep Ganguli and Dustin Li and Eli Tran-Johnson and Ethan Perez and Jamie Kerr and Ja...

2022
[3]

Advances in Neural Information Processing Systems , volume =

Tom Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared D Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert-Voss and Gretchen Krueger and Tom Henighan and Rewon Child and Aditya Ramesh and Daniel Ziegler and Jeffrey Wu and Clemens Winter and Chri...
[4]

Pappas and Eric Wong , title =

Patrick Chao and Alexander Robey and Edgar Dobriban and Hamed Hassani and George J. Pappas and Eric Wong , title =. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages =. 2025 , doi =

2025
[5]

Scandinavian Journal of Statistics , volume =

Sture Holm , title =. Scandinavian Journal of Statistics , volume =
[6]

2026 , eprint =

Xingwei Lin and Wenhao Lin and Sicong Cao and Jiahao Yu and Renke Huang and Lei Xue and Chunming Wu , title =. 2026 , eprint =

2026
[7]

2025 , eprint =

Yangyang Guo and Yangyan Li and Mohan Kankanhalli , title =. 2025 , eprint =

2025
[8]

2024 , url =

Jay Chen and Royce Lu , title =. 2024 , url =

2024
[9]

2025 , eprint =

Bijoy Ahmed Saiem and MD Sadik Hossain Shanto and Rakib Ahsan and Md Rafi ur Rashid , title =. 2025 , eprint =

2025
[10]

Proceedings of the 41st International Conference on Machine Learning , series =

Mantas Mazeika and Long Phan and Xuwang Yin and Andy Zou and Zifan Wang and Norman Mu and Elham Sakhaee and Nathaniel Li and Steven Basart and Bo Li and David Forsyth and Dan Hendrycks , title =. Proceedings of the 41st International Conference on Machine Learning , series =
[11]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages =

Sewon Min and Xinxi Lyu and Ari Holtzman and Mikel Artetxe and Mike Lewis and Hannaneh Hajishirzi and Luke Zettlemoyer , title =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages =. 2022 , doi =

2022
[12]

2022 , eprint =

Catherine Olsson and Nelson Elhage and Neel Nanda and Nicholas Joseph and Nova DasSarma and Tom Henighan and Ben Mann and Amanda Askell and Yuntao Bai and Anna Chen and Tom Conerly and Dawn Drain and Deep Ganguli and Zac Hatfield-Dodds and Danny Hernandez and Scott Johnston and Andy Jones and Jackson Kernion and Liane Lovitt and Kamal Ndousse and Dario Am...

2022
[13]

Christiano and Jan Leike and Ryan Lowe , title =

Long Ouyang and Jeffrey Wu and Xu Jiang and Diogo Almeida and Carroll Wainwright and Pamela Mishkin and Chong Zhang and Sandhini Agarwal and Katarina Slama and Alex Ray and John Schulman and Jacob Hilton and Fraser Kelton and Luke Miller and Maddie Simens and Amanda Askell and Peter Welinder and Paul F. Christiano and Jan Leike and Ryan Lowe , title =. Ad...
[14]

Manning and Stefano Ermon and Chelsea Finn , title =

Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn , title =. Advances in Neural Information Processing Systems , volume =
[15]

Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security , publisher =

Xinyue Shen and Zeyuan Chen and Michael Backes and Yun Shen and Yang Zhang , title =. Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security , publisher =. 2024 , doi =

2024
[16]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Fengqing Jiang and Zhangchen Xu and Luyao Niu and Zhen Xiang and Bhaskar Ramasubramanian and Bo Li and Radha Poovendran , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , doi =

2024
[17]

2024 , eprint =

Xuan Li and Zhanke Zhou and Jianing Zhu and Jiangchao Yao and Tongliang Liu and Bo Han , title =. 2024 , eprint =

2024
[18]

2024 , eprint =

Huijie Lv and Xiao Wang and Yuansen Zhang and Caishuang Huang and Shihan Dou and Junjie Ye and Tao Gui and Qi Zhang and Xuanjing Huang , title =. 2024 , eprint =

2024
[19]

Findings of the Association for Computational Linguistics: ACL 2024 , pages =

Qibing Ren and Chang Gao and Jing Shao and Junchi Yan and Xin Tan and Wai Lam and Lizhuang Ma , title =. Findings of the Association for Computational Linguistics: ACL 2024 , pages =. 2024 , doi =

2024
[20]

2024 , eprint =

Zeming Wei and Yifei Wang and Ang Li and Yichuan Mo and Yisen Wang , title =. 2024 , eprint =

2024
[21]

2024 , eprint =

Zihui Wu and Haichang Gao and Jianping He and Ping Wang , title =. 2024 , eprint =

2024
[22]

International Conference on Learning Representations , year =

Youliang Yuan and Wenxiang Jiao and Wenxuan Wang and Jen-Tse Huang and Pinjia He and Shuming Shi and Zhaopeng Tu , title =. International Conference on Learning Representations , year =
[23]

Zico Kolter and Matt Fredrikson , title =

Andy Zou and Zifan Wang and Nicholas Carlini and Milad Nasr and J. Zico Kolter and Matt Fredrikson , title =. 2023 , eprint =

2023