pith. sign in

arxiv: 2605.21674 · v1 · pith:DP6ULSRAnew · submitted 2026-05-20 · 💻 cs.CR

Adversarial Reframing: A Framework for Targeted Generation in Language Models

Pith reviewed 2026-05-22 09:32 UTC · model grok-4.3

classification 💻 cs.CR
keywords jailbreakinglarge language modelsadversarial promptsprompt optimizationLLM safetyred teamingmulti-agent coordination
0
0 comments X

The pith

THREAT coordinates multiple LLMs in an iterative loop to discover jailbreak prompts that succeed more often and evade safety filters better than earlier methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces THREAT, a framework that reframes the search for jailbreak prompts as a nonconvex optimization problem solved through coordinated reasoning among several large language models in an iterative loop. This produces textual attacks that achieve higher success rates at lower computational cost while being flagged as harmful in under 1 percent of cases, compared with about 50 percent refusals for direct prompts. A sympathetic reader would care because the results demonstrate systematic ways to bypass current alignment safeguards in LLMs. If correct, the work indicates that safety mechanisms can be exploited through reframed adversarial tactics without obvious red flags. The findings position the method as a way to expose and then address hidden weaknesses in deployed models.

Core claim

THREAT formulates prompt discovery as a nonconvex optimization problem and solves it through an iterative multi-LLM coordination loop, yielding jailbreak prompts that deliver higher attack success rates with lower runtime and are flagged as harmful in fewer than 1 percent of cases versus roughly 50 percent for unmodified prompts across tested datasets and model architectures.

What carries the argument

The iterative multi-LLM coordination loop that solves the nonconvex optimization for discovering effective jailbreak prompts.

If this is right

  • Attack success rates rise across diverse datasets and model architectures relative to prior jailbreaking techniques.
  • Computational cost drops while maintaining or improving effectiveness.
  • Generated prompts trigger safety refusals in under 1 percent of cases.
  • Vulnerabilities in aligned LLMs that were previously undetected become measurable.
  • The framework provides a practical means to test and strengthen model safety defenses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same coordination approach might transfer to testing other safety properties such as resistance to misinformation prompts.
  • Safety filters may be tuned primarily to direct harmful language rather than reframed versions that preserve intent.
  • Developers could adapt the optimization loop to generate training data for more robust refusal mechanisms.

Load-bearing premise

The nonconvex optimization and iterative coordination produce prompts whose higher success and low detection rates generalize beyond the tested models and datasets without the search process itself creating detectable artifacts.

What would settle it

Applying the generated prompts to a new LLM architecture and dataset outside the original experiments and finding no gain in success rate or a flagging rate near 50 percent would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2605.21674 by Anindya Bijoy Das, Shahnewaz Karim Sakib, Swati Kar.

Figure 1
Figure 1. Figure 1: Harmful prompts from the HarmfulQA dataset by [3] can be reframed to evade safety filters while preserving adversarial intent. signals, as shown by [24, 22]. Logit-based attacks optimize continuous prompt representations to bypass safety filters, as demonstrated by [18, 64, 51]. Fine￾tuning attacks retrain models on malicious data so benign prompts yield harmful outputs, as reported by [57, 50, 28]. In bla… view at source ↗
Figure 2
Figure 2. Figure 2: Refusal rates (original vs. THREAT) on four different safety-benchmark datasets: (i) discrimination, (ii) information hazards, (iii) safety risks and (iv) harmfulQA. Each bar indicates the percentage (and absolute count) of prompts that GPT-4o refused to answer under each prompting strategy [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average similarity scores for generated responses on the Discrimination dataset. The left panel shows the mean blue score for examples labeled "Blue" versus "Red," and the right panel shows the mean red score for the same two groups [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Refusal counts for the original prompts, categorized by intervals of the differ￾ence between red- and blue-scores; the values displayed above each bar indicate the corresponding refusal counts for our THREAT-derived prompts [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Boxplots of reward safety gains across predicted Red and Blue labels in the HarmfulQA dataset. (b) Distribution of Red and Blue prediction proportions as a function of judge-assigned response scores in the System Risks dataset [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are widely deployed in diverse real-world settings, yet remain vulnerable to jailbreaking, where prompt-based attacks bypass safety filters. We present THREAT (Targeted Harmful generation via Reframing and Exploitation of Adversarial Tactics), a reasoning-driven framework that coordinates multiple LLMs in an iterative search loop to find textual jailbreak prompts. We formulate prompt discovery as a nonconvex optimization problem and provide an efficient solution that lowers runtime and improves attack effectiveness. Across diverse datasets and model architectures, THREAT delivers higher attack success rates with lower computational cost than prior methods. The crafted prompts were flagged as harmful in fewer than 1% of cases, compared with about 50% refusals for the corresponding unmodified prompts. These findings reveal previously undetected vulnerabilities in aligned LLMs and position THREAT as a practical tool for proactively strengthening the safety of foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces THREAT, a framework for targeted harmful generation in language models. It coordinates multiple LLMs in an iterative search loop to discover jailbreak prompts, formulated as a nonconvex optimization problem. The authors claim higher attack success rates and lower computational cost than prior methods across diverse datasets and model architectures, with crafted prompts flagged as harmful in fewer than 1% of cases versus about 50% refusals for unmodified prompts.

Significance. If the empirical claims hold after detailed verification, the work would be significant for AI safety research by exposing vulnerabilities in aligned LLMs and offering a practical red-teaming tool. The multi-LLM iterative coordination and nonconvex optimization framing represent a novel approach to prompt discovery. The reported stealth property (<1% flagging) is a notable strength if reproducible.

major comments (2)
  1. Abstract: The abstract states performance numbers and a nonconvex optimization framing but supplies no concrete algorithm steps, baseline comparisons, statistical tests, or error analysis, so the data cannot be checked for support of the central claims of higher ASR and lower cost.
  2. Method section (iterative coordination loop): The central cross-architecture generalization claim depends on the iterative multi-LLM loop producing prompts without undisclosed model-specific tuning of free parameters such as iteration count and coordination schedule; if effectiveness arises from implicit adaptation to target refusal patterns rather than the nonconvex solver, the reported gains and stealth would not transfer, undermining the claim.
minor comments (1)
  1. Abstract: Define 'attack success rate' and 'computational cost' more precisely to allow direct comparison with baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below in a point-by-point manner, indicating planned revisions where appropriate to improve clarity and verifiability.

read point-by-point responses
  1. Referee: Abstract: The abstract states performance numbers and a nonconvex optimization framing but supplies no concrete algorithm steps, baseline comparisons, statistical tests, or error analysis, so the data cannot be checked for support of the central claims of higher ASR and lower cost.

    Authors: We acknowledge that the abstract's brevity limits the inclusion of full details. In the revised manuscript, we will expand the abstract to include a concise outline of the iterative multi-LLM coordination steps and explicitly name the primary baselines and datasets. Full baseline comparisons, statistical tests (including means, standard deviations, and significance levels), and error analysis are presented in Sections 4 and 5 with supporting tables; we will add a cross-reference in the abstract to direct readers to these sections for verification of the ASR and cost claims. revision: yes

  2. Referee: Method section (iterative coordination loop): The central cross-architecture generalization claim depends on the iterative multi-LLM loop producing prompts without undisclosed model-specific tuning of free parameters such as iteration count and coordination schedule; if effectiveness arises from implicit adaptation to target refusal patterns rather than the nonconvex solver, the reported gains and stealth would not transfer, undermining the claim.

    Authors: Section 3.2 of the manuscript states that hyperparameters including iteration count and coordination schedule are fixed values determined on a held-out validation set and applied uniformly across all target models and architectures with no per-model tuning. This fixed-parameter design supports the reported cross-architecture generalization. To further address the concern, we will add an ablation study in the revision showing performance stability across a range of iteration counts and schedules, confirming that gains derive from the nonconvex optimization and multi-LLM coordination rather than implicit adaptation. We maintain that no undisclosed model-specific tuning occurs. revision: partial

Circularity Check

0 steps flagged

No circularity in claimed derivation or results

full rationale

The abstract and available text present THREAT as an empirical framework that formulates prompt search as nonconvex optimization and reports measured attack success rates, runtime, and detection rates across datasets and models. No equations, fitted parameters, or self-citation chains are exhibited that would reduce the reported performance metrics to inputs by construction. The central claims rest on experimental comparisons rather than any definitional equivalence or renamed known result. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that iterative LLM coordination can solve the stated nonconvex prompt-discovery problem more efficiently than existing single-model or gradient-based attacks, plus the implicit premise that the tested models are representative of aligned LLMs in general.

free parameters (1)
  • iteration count and coordination schedule
    The iterative search loop requires at least one tunable parameter for number of rounds or model hand-off rules that is not specified in the abstract.
axioms (1)
  • domain assumption Multiple LLMs can be prompted to improve each other's jailbreak attempts without the process being blocked by their own safety filters
    The framework description presupposes that the participating models remain cooperative throughout the search loop.

pith-pipeline@v0.9.0 · 5688 in / 1382 out tokens · 55875 ms · 2026-05-22T09:32:44.425771+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 6 internal anchors

  1. [1]

    AI, M.: Mixtral of experts (2023), https://mistral.ai/news/mixtral-of-experts/, accessed: 2025-06-06

  2. [2]

    arXiv preprint arXiv:2404.02151 (2024)

    Andriushchenko, M., Croce, F., Flammarion, N.: Jailbreaking leading safety-aligned LLMs with simple adaptive attacks. arXiv preprint arXiv:2404.02151 (2024)

  3. [3]

    Bhardwaj, R., Poria, S.: Red-teaming large language models using chain of utter- ances for safety-alignment (2023)

  4. [4]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al.: Deepseek LLM: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024)

  5. [5]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Cao, B., Cao, Y., Lin, L., Chen, J.: Defending against alignment-breaking attacks via robustly aligned llm. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 10542–10560 (2024)

  6. [6]

    In: Findings of the Association for Computational Linguistics (ACL)

    Chang, Z., Li, M., Liu, Y., Wang, J., Wang, Q., Liu, Y.: Play guessing game with LLM: Indirect jailbreak attack with implicit clues. In: Findings of the Association for Computational Linguistics (ACL). pp. 5135–5147 (2024)

  7. [7]

    In: IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)

    Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E.: Jailbreaking black box large language models in twenty queries. In: IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). pp. 23–42 (2025)

  8. [8]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Chen, Y., Li, H., Li, Y., Liu, Y., Song, Y., Hooi, B.: Topicattack: An indirect prompt injection attack via topic transition. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 7338–7356 (2025)

  9. [9]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Chen, Y., Li, H., Zheng, Z., Wu, D., Song, Y., Hooi, B.: Defense against prompt injection attack by leveraging attack techniques. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 18331–18347 (2025)

  10. [10]

    Authorea Preprints (2025) 18 Sakib et al

    Das, A.B., Sakib, S.K.: Breaking the shield: Vulnerabilities in content moderation for multimodal language models. Authorea Preprints (2025) 18 Sakib et al

  11. [11]

    In: ICCV Workshop on Computer Vision for Automated Medical Diagnosis (2025)

    Das, A.B., Sakib, S.K., Ahmed, S.: Trustworthy medical imaging with large lan- guage models: A study of hallucinations across modalities. In: ICCV Workshop on Computer Vision for Automated Medical Diagnosis (2025)

  12. [12]

    In: Network and Distributed System Security (NDSS) Symposium (2024)

    Deng, G., Liu, Y., Li, Y., Wang, K., Zhang, Y., Li, Z., Wang, H., Zhang, T., Liu, Y.: Masterkey: Automated jailbreaking of large language model chatbots. In: Network and Distributed System Security (NDSS) Symposium (2024)

  13. [13]

    In: Proc

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proc. of North American chapter of the association for computational linguistics (NAACL): human language technologies. pp. 4171–4186 (2019)

  14. [14]

    arXiv preprint arXiv:2311.08268 (2023)

    Ding, P., Kuang, J., Ma, D., Cao, X., Xian, Y., Chen, J., Huang, S.: A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268 (2023)

  15. [15]

    In: IEEE Symposium on Security and Privacy (SP)

    Dong, Y., Meng, X., Yu, N., Li, Z., Guo, S.: Fuzz-testing meets LLM-based agents: An automated and efficient framework for jailbreaking text-to-image generation models. In: IEEE Symposium on Security and Privacy (SP). pp. 373–391 (2025)

  16. [16]

    In: Proc of Conference of the Nations of the Ameri- cas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

    Ghosh, S., Varshney, P., Sreedhar, M.N., Padmakumar, A., Rebedea, T., Varghese, J.R., Parisien, C.: AEGIS2.0 : A diverse ai safety dataset and risks taxonomy for alignment of LLM guardrails. In: Proc of Conference of the Nations of the Ameri- cas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)....

  17. [17]

    Gretel: Gretel synthetic safety alignment dataset (2024), https://huggingface.co/datasets/gretelai/gretel-safety-alignment-en-v1

  18. [18]

    arXiv preprint arXiv:2402.08679 (2024)

    Guo, X., Yu, F., Zhang, H., Qin, L., Hu, B.: Cold-attack: Jailbreaking LLMs with stealthiness and controllability. arXiv preprint arXiv:2402.08679 (2024)

  19. [19]

    Advances in Neural Information Processing Systems37, 128260–128279 (2024)

    Hayase, J., Borevković, E., Carlini, N., Tramèr, F., Nasr, M.: Query-based adver- sarial prompt generation. Advances in Neural Information Processing Systems37, 128260–128279 (2024)

  20. [20]

    Advances in Neural Information Processing Systems37, 65072–65094 (2024)

    Hsu, C.Y., Tsai, Y.L., Lin, C.H., Chen, P.Y., Yu, C.M., Huang, C.Y.: Safe lora: The silver lining of reducing safety risks when finetuning large language models. Advances in Neural Information Processing Systems37, 65072–65094 (2024)

  21. [21]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Hu, W., Li, H., Jing, H., Hu, Q., Zeng, Z., Han, S., Heli, X., Chu, T., Hu, P., Song, Y.: Context reasoner: Incentivizing reasoning capability for contextualized privacy and safety compliance via reinforcement learning. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 865–883 (2025)

  22. [22]

    In: Proc

    Huang, D., Shah, A., Araujo, A., Wagner, D., Sitawarin, C.: Stronger universal and transferable attacks by suppressing refusals. In: Proc. of Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 5850–5876 (2025)

  23. [23]

    Advances in Neural Information Processing Systems (NeurIPS)36, 24678–24704 (2023)

    Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., et al.: Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. Advances in Neural Information Processing Systems (NeurIPS)36, 24678–24704 (2023)

  24. [24]

    arXiv preprint arXiv:2405.21018

    Jia, X., Pang, T., Du, C., Huang, Y., Gu, J., Liu, Y., Cao, X., Lin, M.: Improved techniques for optimization-based jailbreaking on large language models. arXiv preprint arXiv:2405.21018 (2024)

  25. [25]

    Jiang, F., Xu, Z., Niu, L., Xiang, Z., Ramasubramanian, B., Li, B., Poovendran, R.: ArtPrompt:ASCIIart-basedjailbreakattacksagainstalignedLLMs.In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 15157–15173 (2024) Adversarial Reframing 19

  26. [26]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

    Kang, M., Chen, Z., Li, B.: C-SafeGen: Certified safe LLM generation with claim- based streaming guardrails. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

  27. [27]

    Certifying llm safety against adversar- ial prompting,

    Kumar, A., Agarwal, C., Srinivas, S., Li, A.J., Feizi, S., Lakkaraju, H.: Certifying LLM safety against adversarial prompting. arXiv preprint arXiv:2309.02705 (2023)

  28. [28]

    Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b

    Lermen, S., Rogers-Smith, C., Ladish, J.: LoRA fine-tuning efficiently undoes safety training in LLaMA 2-chat 70b. arXiv preprint arXiv:2310.20624 (2023)

  29. [29]

    arXiv preprint arXiv:2402.16914 (2024)

    Li, X., Wang, R., Cheng, M., Zhou, T., Hsieh, C.J.: Drattack: Prompt decom- position and reconstruction makes powerful LLM jailbreakers. arXiv preprint arXiv:2402.16914 (2024)

  30. [30]

    DeepInception: Hypnotize Large Language Model to Be Jailbreaker

    Li, X., Zhou, Z., Zhu, J., Yao, J., Liu, T., Han, B.: DeepInception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191 (2023)

  31. [31]

    In: Findings of the Association for Computational Linguistics: ACL 2025

    Liu, S., Liu, H., Liu, A., Bingchen, D., Qi, Z., Yan, Y., Geng, H., Jiang, P., Liu, J., Hu, X.: A survey on proactive defense strategies against misinformation in large language models. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 18144–18155 (2025)

  32. [32]

    In: 33rd USENIX Security Symposium (USENIX Security 24)

    Liu, T., Zhang, Y., Zhao, Z., Dong, Y., Meng, G., Chen, K.: Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction. In: 33rd USENIX Security Symposium (USENIX Security 24). pp. 4711–4728 (2024)

  33. [33]

    arXiv preprint arXiv:2410.05295 (2024)

    Liu, X., Li, P., Suh, E., Vorobeychik, Y., Mao, Z., Jha, S., McDaniel, P., Sun, H., Li, B., Xiao, C.: Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak LLMs. arXiv preprint arXiv:2410.05295 (2024)

  34. [34]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Liu, X., Xu, N., Chen, M., Xiao, C.: Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451 (2023)

  35. [35]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

    Maini, P., Goyal, S., Sam, D., Robey, A., Savani, Y., Jiang, Y., Zou, A., Fredrikson, M., Lipton, Z.C., Kolter, J.Z.: Safety pretraining: Toward the next generation of safe AI. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

  36. [36]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al.: Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249 (2024)

  37. [37]

    Advances in Neural Information Processing Systems37, 61065–61105 (2024)

    Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., Karbasi, A.: Tree of attacks: Jailbreaking black-box LLMs automatically. Advances in Neural Information Processing Systems37, 61065–61105 (2024)

  38. [38]

    In: Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency

    Mehta, M., Giunchiglia, F.: Understanding gen alpha’s digital language: Evaluation of LLM safety systems for content moderation. In: Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. pp. 2863–2873 (2025)

  39. [39]

    Journal of the Franklin Institute334(2), 307–318 (1997)

    Menéndez, M.L., Pardo, J.A., Pardo, L., Pardo, M.d.C.: The Jensen-Shannon divergence. Journal of the Franklin Institute334(2), 307–318 (1997)

  40. [40]

    OpenAI:Gpt-4omodeldocumentation.https://platform.openai.com/docs/models/gpt- 4o (2024), accessed: 2025-06-04

  41. [41]

    Qi, X., Zeng, Y., Xie, T., Chen, P.Y., Jia, R., Mittal, P., Henderson, P.: Fine-tuning aligned language models compromises safety, even when users do not intend to! In: 12th International Conference on Learning Representations, ICLR 2024 (2024)

  42. [42]

    In: Proc of the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

    Rath, P., Shrawgi, H., Agrawal, P., Dandapat, S.: LLM safety for children. In: Proc of the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track). pp. 809–821 (2025)

  43. [43]

    arXiv preprint arXiv:2502.01633 (2025) 20 Sakib et al

    Sabbaghi, M., Kassianik, P., Pappas, G., Singer, Y., Karbasi, A., Hassani, H.: Adversarial reasoning at jailbreaking time. arXiv preprint arXiv:2502.01633 (2025) 20 Sakib et al

  44. [44]

    Sakib, S.: Threat: Targeted harmful generation via reframing and exploitation of adversarial tactics (2026), https://github.com/Shahanewaz/THREAT_Codes

  45. [45]

    In: accepted IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW) (2026)

    Sakib, S.K., Ahmed, S., Das, A.B.: An iterative framework for controlled attribute transformation using multimodal models. In: accepted IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW) (2026)

  46. [46]

    In: IEEE International Conference on Big Data (BigData)

    Sakib, S.K., Das, A.B.: Challenging fairness: A comprehensive exploration of bias in LLM-based recommendations. In: IEEE International Conference on Big Data (BigData). pp. 1585–1592 (2024)

  47. [47]

    In: Trustworthy Natural Language Processing (TrustNLP) Colocated with NAACL (2025)

    Sakib, S.K., Das, A.B., Ahmed, S.: Battling misinformation: An empirical study on adversarial factuality in open-source large language models. In: Trustworthy Natural Language Processing (TrustNLP) Colocated with NAACL (2025)

  48. [48]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  49. [49]

    In: First Conference on Language Modeling

    Tunstall, L., Beeching, E.E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., Von Werra, L., Fourrier, C., Habib, N., et al.: Zephyr: Direct distillation of lm alignment. In: First Conference on Language Modeling

  50. [50]

    Advances in Neural Information Processing Systems 37, 5210–5243 (2024)

    Wang, J., Li, J., Li, Y., Qi, X., Hu, J., Li, S., McDaniel, P., Chen, M., Li, B., Xiao, C.: Backdooralign: Mitigating fine-tuning based jailbreak attack with backdoor enhanced safety alignment. Advances in Neural Information Processing Systems 37, 5210–5243 (2024)

  51. [51]

    In: 34th USENIX Security Symposium

    Wang, X., Wu, D., Ji, Z., Li, Z., Ma, P., Wang, S., Li, Y., Liu, Y., Liu, N., Rahmel, J.: SelfDefend:LLMs can defend themselves against jailbreaking in a practical manner. In: 34th USENIX Security Symposium. pp. 2441–2460 (2025)

  52. [52]

    Advances in Neural Information Processing Systems36, 51008–51025 (2023)

    Wen, Y., Jain, N., Kirchenbauer, J., Goldblum, M., Geiping, J., Goldstein, T.: Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. Advances in Neural Information Processing Systems36, 51008–51025 (2023)

  53. [53]

    In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Xiao, Z., Yang, Y., Chen, G., Chen, Y.: Distract large language models for automatic jailbreak attack. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 16230–16244 (2024)

  54. [54]

    ACM Transactions on Software Engineering and Methodology (2024)

    Xu, H., Wang, S., Li, N., Wang, K., Zhao, Y., Chen, K., Yu, T., Liu, Y., Wang, H.: Large language models for cyber security: A systematic literature review. ACM Transactions on Software Engineering and Methodology (2024)

  55. [55]

    In: IEEE Symposium on Security and Privacy (S&P)

    Yang, K., Tao, G., Chen, X., Xu, J.: Alleviating the fear of losing alignment in LLM fine-tuning. In: IEEE Symposium on Security and Privacy (S&P). pp. 2152–2170. IEEE (2025)

  56. [56]

    In: IEEE Symposium on Security and Privacy (S&P)

    Yang, K., Tao, G., Chen, X., Xu, J.: Alleviating the fear of losing alignment in LLM fine-tuning. In: IEEE Symposium on Security and Privacy (S&P). pp. 2152–2170 (2025). https://doi.org/10.1109/SP61157.2025.00171

  57. [57]

    Y ., Zhao, X., and Lin, D

    Yang, X., Wang, X., Zhang, Q., Petzold, L., Wang, W.Y., Zhao, X., Lin, D.: Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949 (2023)

  58. [58]

    In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Yao, D., Zhang, J., Harris, I.G., Carlsson, M.: FuzzLLM: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4485–4489 (2024)

  59. [59]

    Gpt- 4 is too smart to be safe: Stealthy chat with llms via cipher,

    Yuan, Y., Jiao, W., Wang, W., Huang, J.t., He, P., et al.: Gpt-4 is too smart to be safe: Stealthy chat with LLMs via cipher. arXiv preprint arXiv:2308.06463 (2023) Adversarial Reframing 21

  60. [60]

    Advances in Neural Information Processing Systems37, 54947–54973 (2024)

    Zeng, C., Tang, S., Yang, X., Chen, Y., Sun, Y., Xu, Z., Li, Y., Chen, H., Cheng, W., Xu, D.D.: Dald: Improving logits-based detector without logits from black-box LLMs. Advances in Neural Information Processing Systems37, 54947–54973 (2024)

  61. [61]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Zhang, J., Wang, X., Ren, W., Jiang, L., Wang, D., Liu, K.: RATT: A thought structure for coherent and correct llm reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 26733–26741 (2025)

  62. [62]

    In: Findings of the Association for Computational Linguistics: NAACL 2025

    Zhang, T., Cao, B., Cao, Y., Lin, L., et al.: WordGame: Efficient & effective LLM jailbreak via simultaneous obfuscation in query and response. In: Findings of the Association for Computational Linguistics: NAACL 2025. pp. 4779–4807 (2025)

  63. [63]

    In: International Conference on Machine Learning

    Zhang, Y., Yu, T., Tian, H., Fu, C., Li, P., Zeng, J., Xie, W., Shi, Y., Zhang, H., Wu, J., et al.: MM-RLHF: The next step forward in multimodal LLM alignment. In: International Conference on Machine Learning. pp. 76625–76654. PMLR (2025)

  64. [64]

    arXiv preprint arXiv:2404.16369 (2024)

    Zhou, Y., Huang, Z., Lu, F., Qin, Z., Wang, W.: Don’t say no: Jailbreaking LLM by suppressing refusal. arXiv preprint arXiv:2404.16369 (2024)

  65. [65]

    In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

    Zou, A., Phan, L., Wang, J., Duenas, D., Lin, M., Andriushchenko, M., Kolter, J.Z., Fredrikson, M., Hendrycks, D.: Improving alignment and robustness with circuit breakers. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

  66. [66]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)