Adversarial Reframing: A Framework for Targeted Generation in Language Models

Anindya Bijoy Das; Shahnewaz Karim Sakib; Swati Kar

arxiv: 2605.21674 · v1 · pith:DP6ULSRAnew · submitted 2026-05-20 · 💻 cs.CR

Adversarial Reframing: A Framework for Targeted Generation in Language Models

Shahnewaz Karim Sakib , Swati Kar , Anindya Bijoy Das This is my paper

Pith reviewed 2026-05-22 09:32 UTC · model grok-4.3

classification 💻 cs.CR

keywords jailbreakinglarge language modelsadversarial promptsprompt optimizationLLM safetyred teamingmulti-agent coordination

0 comments

The pith

THREAT coordinates multiple LLMs in an iterative loop to discover jailbreak prompts that succeed more often and evade safety filters better than earlier methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces THREAT, a framework that reframes the search for jailbreak prompts as a nonconvex optimization problem solved through coordinated reasoning among several large language models in an iterative loop. This produces textual attacks that achieve higher success rates at lower computational cost while being flagged as harmful in under 1 percent of cases, compared with about 50 percent refusals for direct prompts. A sympathetic reader would care because the results demonstrate systematic ways to bypass current alignment safeguards in LLMs. If correct, the work indicates that safety mechanisms can be exploited through reframed adversarial tactics without obvious red flags. The findings position the method as a way to expose and then address hidden weaknesses in deployed models.

Core claim

THREAT formulates prompt discovery as a nonconvex optimization problem and solves it through an iterative multi-LLM coordination loop, yielding jailbreak prompts that deliver higher attack success rates with lower runtime and are flagged as harmful in fewer than 1 percent of cases versus roughly 50 percent for unmodified prompts across tested datasets and model architectures.

What carries the argument

The iterative multi-LLM coordination loop that solves the nonconvex optimization for discovering effective jailbreak prompts.

If this is right

Attack success rates rise across diverse datasets and model architectures relative to prior jailbreaking techniques.
Computational cost drops while maintaining or improving effectiveness.
Generated prompts trigger safety refusals in under 1 percent of cases.
Vulnerabilities in aligned LLMs that were previously undetected become measurable.
The framework provides a practical means to test and strengthen model safety defenses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coordination approach might transfer to testing other safety properties such as resistance to misinformation prompts.
Safety filters may be tuned primarily to direct harmful language rather than reframed versions that preserve intent.
Developers could adapt the optimization loop to generate training data for more robust refusal mechanisms.

Load-bearing premise

The nonconvex optimization and iterative coordination produce prompts whose higher success and low detection rates generalize beyond the tested models and datasets without the search process itself creating detectable artifacts.

What would settle it

Applying the generated prompts to a new LLM architecture and dataset outside the original experiments and finding no gain in success rate or a flagging rate near 50 percent would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2605.21674 by Anindya Bijoy Das, Shahnewaz Karim Sakib, Swati Kar.

**Figure 1.** Figure 1: Harmful prompts from the HarmfulQA dataset by [3] can be reframed to evade safety filters while preserving adversarial intent. signals, as shown by [24, 22]. Logit-based attacks optimize continuous prompt representations to bypass safety filters, as demonstrated by [18, 64, 51]. Finetuning attacks retrain models on malicious data so benign prompts yield harmful outputs, as reported by [57, 50, 28]. In bla… view at source ↗

**Figure 2.** Figure 2: Refusal rates (original vs. THREAT) on four different safety-benchmark datasets: (i) discrimination, (ii) information hazards, (iii) safety risks and (iv) harmfulQA. Each bar indicates the percentage (and absolute count) of prompts that GPT-4o refused to answer under each prompting strategy [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Average similarity scores for generated responses on the Discrimination dataset. The left panel shows the mean blue score for examples labeled "Blue" versus "Red," and the right panel shows the mean red score for the same two groups [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Refusal counts for the original prompts, categorized by intervals of the difference between red- and blue-scores; the values displayed above each bar indicate the corresponding refusal counts for our THREAT-derived prompts [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Boxplots of reward safety gains across predicted Red and Blue labels in the HarmfulQA dataset. (b) Distribution of Red and Blue prediction proportions as a function of judge-assigned response scores in the System Risks dataset [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are widely deployed in diverse real-world settings, yet remain vulnerable to jailbreaking, where prompt-based attacks bypass safety filters. We present THREAT (Targeted Harmful generation via Reframing and Exploitation of Adversarial Tactics), a reasoning-driven framework that coordinates multiple LLMs in an iterative search loop to find textual jailbreak prompts. We formulate prompt discovery as a nonconvex optimization problem and provide an efficient solution that lowers runtime and improves attack effectiveness. Across diverse datasets and model architectures, THREAT delivers higher attack success rates with lower computational cost than prior methods. The crafted prompts were flagged as harmful in fewer than 1% of cases, compared with about 50% refusals for the corresponding unmodified prompts. These findings reveal previously undetected vulnerabilities in aligned LLMs and position THREAT as a practical tool for proactively strengthening the safety of foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

THREAT uses multi-LLM coordination in an iterative loop framed as nonconvex optimization to generate jailbreak prompts, with reported gains in success rate and stealth, though the abstract leaves the concrete method and supporting data thin.

read the letter

The main point is that this paper introduces THREAT, a framework that runs several LLMs in a loop to search for and refine jailbreak prompts, treating the task as a nonconvex optimization problem. It claims this produces higher attack success rates at lower cost than earlier methods, and the resulting prompts get flagged as harmful in under 1% of cases while plain versions trigger refusals about half the time.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces THREAT, a framework for targeted harmful generation in language models. It coordinates multiple LLMs in an iterative search loop to discover jailbreak prompts, formulated as a nonconvex optimization problem. The authors claim higher attack success rates and lower computational cost than prior methods across diverse datasets and model architectures, with crafted prompts flagged as harmful in fewer than 1% of cases versus about 50% refusals for unmodified prompts.

Significance. If the empirical claims hold after detailed verification, the work would be significant for AI safety research by exposing vulnerabilities in aligned LLMs and offering a practical red-teaming tool. The multi-LLM iterative coordination and nonconvex optimization framing represent a novel approach to prompt discovery. The reported stealth property (<1% flagging) is a notable strength if reproducible.

major comments (2)

Abstract: The abstract states performance numbers and a nonconvex optimization framing but supplies no concrete algorithm steps, baseline comparisons, statistical tests, or error analysis, so the data cannot be checked for support of the central claims of higher ASR and lower cost.
Method section (iterative coordination loop): The central cross-architecture generalization claim depends on the iterative multi-LLM loop producing prompts without undisclosed model-specific tuning of free parameters such as iteration count and coordination schedule; if effectiveness arises from implicit adaptation to target refusal patterns rather than the nonconvex solver, the reported gains and stealth would not transfer, undermining the claim.

minor comments (1)

Abstract: Define 'attack success rate' and 'computational cost' more precisely to allow direct comparison with baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below in a point-by-point manner, indicating planned revisions where appropriate to improve clarity and verifiability.

read point-by-point responses

Referee: Abstract: The abstract states performance numbers and a nonconvex optimization framing but supplies no concrete algorithm steps, baseline comparisons, statistical tests, or error analysis, so the data cannot be checked for support of the central claims of higher ASR and lower cost.

Authors: We acknowledge that the abstract's brevity limits the inclusion of full details. In the revised manuscript, we will expand the abstract to include a concise outline of the iterative multi-LLM coordination steps and explicitly name the primary baselines and datasets. Full baseline comparisons, statistical tests (including means, standard deviations, and significance levels), and error analysis are presented in Sections 4 and 5 with supporting tables; we will add a cross-reference in the abstract to direct readers to these sections for verification of the ASR and cost claims. revision: yes
Referee: Method section (iterative coordination loop): The central cross-architecture generalization claim depends on the iterative multi-LLM loop producing prompts without undisclosed model-specific tuning of free parameters such as iteration count and coordination schedule; if effectiveness arises from implicit adaptation to target refusal patterns rather than the nonconvex solver, the reported gains and stealth would not transfer, undermining the claim.

Authors: Section 3.2 of the manuscript states that hyperparameters including iteration count and coordination schedule are fixed values determined on a held-out validation set and applied uniformly across all target models and architectures with no per-model tuning. This fixed-parameter design supports the reported cross-architecture generalization. To further address the concern, we will add an ablation study in the revision showing performance stability across a range of iteration counts and schedules, confirming that gains derive from the nonconvex optimization and multi-LLM coordination rather than implicit adaptation. We maintain that no undisclosed model-specific tuning occurs. revision: partial

Circularity Check

0 steps flagged

No circularity in claimed derivation or results

full rationale

The abstract and available text present THREAT as an empirical framework that formulates prompt search as nonconvex optimization and reports measured attack success rates, runtime, and detection rates across datasets and models. No equations, fitted parameters, or self-citation chains are exhibited that would reduce the reported performance metrics to inputs by construction. The central claims rest on experimental comparisons rather than any definitional equivalence or renamed known result. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that iterative LLM coordination can solve the stated nonconvex prompt-discovery problem more efficiently than existing single-model or gradient-based attacks, plus the implicit premise that the tested models are representative of aligned LLMs in general.

free parameters (1)

iteration count and coordination schedule
The iterative search loop requires at least one tunable parameter for number of rounds or model hand-off rules that is not specified in the abstract.

axioms (1)

domain assumption Multiple LLMs can be prompted to improve each other's jailbreak attempts without the process being blocked by their own safety filters
The framework description presupposes that the participating models remain cooperative throughout the search loop.

pith-pipeline@v0.9.0 · 5688 in / 1382 out tokens · 55875 ms · 2026-05-22T09:32:44.425771+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate prompt discovery as a nonconvex optimization problem... maximize R(xi−1, xi−1 + δi) subject to ε1 < S(xi−1, xi) < ε2
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

THREAT coordinates multiple LLMs in an iterative search loop

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 6 internal anchors

[1]

AI, M.: Mixtral of experts (2023), https://mistral.ai/news/mixtral-of-experts/, accessed: 2025-06-06

work page 2023
[2]

arXiv preprint arXiv:2404.02151 (2024)

Andriushchenko, M., Croce, F., Flammarion, N.: Jailbreaking leading safety-aligned LLMs with simple adaptive attacks. arXiv preprint arXiv:2404.02151 (2024)

work page arXiv 2024
[3]

Bhardwaj, R., Poria, S.: Red-teaming large language models using chain of utter- ances for safety-alignment (2023)

work page 2023
[4]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al.: Deepseek LLM: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Cao, B., Cao, Y., Lin, L., Chen, J.: Defending against alignment-breaking attacks via robustly aligned llm. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 10542–10560 (2024)

work page 2024
[6]

In: Findings of the Association for Computational Linguistics (ACL)

Chang, Z., Li, M., Liu, Y., Wang, J., Wang, Q., Liu, Y.: Play guessing game with LLM: Indirect jailbreak attack with implicit clues. In: Findings of the Association for Computational Linguistics (ACL). pp. 5135–5147 (2024)

work page 2024
[7]

In: IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E.: Jailbreaking black box large language models in twenty queries. In: IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). pp. 23–42 (2025)

work page 2025
[8]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Chen, Y., Li, H., Li, Y., Liu, Y., Song, Y., Hooi, B.: Topicattack: An indirect prompt injection attack via topic transition. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 7338–7356 (2025)

work page 2025
[9]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Chen, Y., Li, H., Zheng, Z., Wu, D., Song, Y., Hooi, B.: Defense against prompt injection attack by leveraging attack techniques. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 18331–18347 (2025)

work page 2025
[10]

Authorea Preprints (2025) 18 Sakib et al

Das, A.B., Sakib, S.K.: Breaking the shield: Vulnerabilities in content moderation for multimodal language models. Authorea Preprints (2025) 18 Sakib et al

work page 2025
[11]

In: ICCV Workshop on Computer Vision for Automated Medical Diagnosis (2025)

Das, A.B., Sakib, S.K., Ahmed, S.: Trustworthy medical imaging with large lan- guage models: A study of hallucinations across modalities. In: ICCV Workshop on Computer Vision for Automated Medical Diagnosis (2025)

work page 2025
[12]

In: Network and Distributed System Security (NDSS) Symposium (2024)

Deng, G., Liu, Y., Li, Y., Wang, K., Zhang, Y., Li, Z., Wang, H., Zhang, T., Liu, Y.: Masterkey: Automated jailbreaking of large language model chatbots. In: Network and Distributed System Security (NDSS) Symposium (2024)

work page 2024
[13]

In: Proc

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proc. of North American chapter of the association for computational linguistics (NAACL): human language technologies. pp. 4171–4186 (2019)

work page 2019
[14]

arXiv preprint arXiv:2311.08268 (2023)

Ding, P., Kuang, J., Ma, D., Cao, X., Xian, Y., Chen, J., Huang, S.: A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268 (2023)

work page arXiv 2023
[15]

In: IEEE Symposium on Security and Privacy (SP)

Dong, Y., Meng, X., Yu, N., Li, Z., Guo, S.: Fuzz-testing meets LLM-based agents: An automated and efficient framework for jailbreaking text-to-image generation models. In: IEEE Symposium on Security and Privacy (SP). pp. 373–391 (2025)

work page 2025
[16]

In: Proc of Conference of the Nations of the Ameri- cas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Ghosh, S., Varshney, P., Sreedhar, M.N., Padmakumar, A., Rebedea, T., Varghese, J.R., Parisien, C.: AEGIS2.0 : A diverse ai safety dataset and risks taxonomy for alignment of LLM guardrails. In: Proc of Conference of the Nations of the Ameri- cas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)....

work page 2025
[17]

Gretel: Gretel synthetic safety alignment dataset (2024), https://huggingface.co/datasets/gretelai/gretel-safety-alignment-en-v1

work page 2024
[18]

arXiv preprint arXiv:2402.08679 (2024)

Guo, X., Yu, F., Zhang, H., Qin, L., Hu, B.: Cold-attack: Jailbreaking LLMs with stealthiness and controllability. arXiv preprint arXiv:2402.08679 (2024)

work page arXiv 2024
[19]

Advances in Neural Information Processing Systems37, 128260–128279 (2024)

Hayase, J., Borevković, E., Carlini, N., Tramèr, F., Nasr, M.: Query-based adver- sarial prompt generation. Advances in Neural Information Processing Systems37, 128260–128279 (2024)

work page 2024
[20]

Advances in Neural Information Processing Systems37, 65072–65094 (2024)

Hsu, C.Y., Tsai, Y.L., Lin, C.H., Chen, P.Y., Yu, C.M., Huang, C.Y.: Safe lora: The silver lining of reducing safety risks when finetuning large language models. Advances in Neural Information Processing Systems37, 65072–65094 (2024)

work page 2024
[21]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Hu, W., Li, H., Jing, H., Hu, Q., Zeng, Z., Han, S., Heli, X., Chu, T., Hu, P., Song, Y.: Context reasoner: Incentivizing reasoning capability for contextualized privacy and safety compliance via reinforcement learning. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 865–883 (2025)

work page 2025
[22]

In: Proc

Huang, D., Shah, A., Araujo, A., Wagner, D., Sitawarin, C.: Stronger universal and transferable attacks by suppressing refusals. In: Proc. of Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 5850–5876 (2025)

work page 2025
[23]

Advances in Neural Information Processing Systems (NeurIPS)36, 24678–24704 (2023)

Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., et al.: Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. Advances in Neural Information Processing Systems (NeurIPS)36, 24678–24704 (2023)

work page 2023
[24]

arXiv preprint arXiv:2405.21018

Jia, X., Pang, T., Du, C., Huang, Y., Gu, J., Liu, Y., Cao, X., Lin, M.: Improved techniques for optimization-based jailbreaking on large language models. arXiv preprint arXiv:2405.21018 (2024)

work page arXiv 2024
[25]

Jiang, F., Xu, Z., Niu, L., Xiang, Z., Ramasubramanian, B., Li, B., Poovendran, R.: ArtPrompt:ASCIIart-basedjailbreakattacksagainstalignedLLMs.In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 15157–15173 (2024) Adversarial Reframing 19

work page 2024
[26]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Kang, M., Chen, Z., Li, B.: C-SafeGen: Certified safe LLM generation with claim- based streaming guardrails. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

work page 2025
[27]

Certifying llm safety against adversar- ial prompting,

Kumar, A., Agarwal, C., Srinivas, S., Li, A.J., Feizi, S., Lakkaraju, H.: Certifying LLM safety against adversarial prompting. arXiv preprint arXiv:2309.02705 (2023)

work page arXiv 2023
[28]

Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b

Lermen, S., Rogers-Smith, C., Ladish, J.: LoRA fine-tuning efficiently undoes safety training in LLaMA 2-chat 70b. arXiv preprint arXiv:2310.20624 (2023)

work page arXiv 2023
[29]

arXiv preprint arXiv:2402.16914 (2024)

Li, X., Wang, R., Cheng, M., Zhou, T., Hsieh, C.J.: Drattack: Prompt decom- position and reconstruction makes powerful LLM jailbreakers. arXiv preprint arXiv:2402.16914 (2024)

work page arXiv 2024
[30]

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

Li, X., Zhou, Z., Zhu, J., Yao, J., Liu, T., Han, B.: DeepInception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

In: Findings of the Association for Computational Linguistics: ACL 2025

Liu, S., Liu, H., Liu, A., Bingchen, D., Qi, Z., Yan, Y., Geng, H., Jiang, P., Liu, J., Hu, X.: A survey on proactive defense strategies against misinformation in large language models. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 18144–18155 (2025)

work page 2025
[32]

In: 33rd USENIX Security Symposium (USENIX Security 24)

Liu, T., Zhang, Y., Zhao, Z., Dong, Y., Meng, G., Chen, K.: Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction. In: 33rd USENIX Security Symposium (USENIX Security 24). pp. 4711–4728 (2024)

work page 2024
[33]

arXiv preprint arXiv:2410.05295 (2024)

Liu, X., Li, P., Suh, E., Vorobeychik, Y., Mao, Z., Jha, S., McDaniel, P., Sun, H., Li, B., Xiao, C.: Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak LLMs. arXiv preprint arXiv:2410.05295 (2024)

work page arXiv 2024
[34]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Liu, X., Xu, N., Chen, M., Xiao, C.: Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Maini, P., Goyal, S., Sam, D., Robey, A., Savani, Y., Jiang, Y., Zou, A., Fredrikson, M., Lipton, Z.C., Kolter, J.Z.: Safety pretraining: Toward the next generation of safe AI. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

work page 2025
[36]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al.: Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Advances in Neural Information Processing Systems37, 61065–61105 (2024)

Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., Karbasi, A.: Tree of attacks: Jailbreaking black-box LLMs automatically. Advances in Neural Information Processing Systems37, 61065–61105 (2024)

work page 2024
[38]

In: Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency

Mehta, M., Giunchiglia, F.: Understanding gen alpha’s digital language: Evaluation of LLM safety systems for content moderation. In: Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. pp. 2863–2873 (2025)

work page 2025
[39]

Journal of the Franklin Institute334(2), 307–318 (1997)

Menéndez, M.L., Pardo, J.A., Pardo, L., Pardo, M.d.C.: The Jensen-Shannon divergence. Journal of the Franklin Institute334(2), 307–318 (1997)

work page 1997
[40]

OpenAI:Gpt-4omodeldocumentation.https://platform.openai.com/docs/models/gpt- 4o (2024), accessed: 2025-06-04

work page 2024
[41]

Qi, X., Zeng, Y., Xie, T., Chen, P.Y., Jia, R., Mittal, P., Henderson, P.: Fine-tuning aligned language models compromises safety, even when users do not intend to! In: 12th International Conference on Learning Representations, ICLR 2024 (2024)

work page 2024
[42]

In: Proc of the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

Rath, P., Shrawgi, H., Agrawal, P., Dandapat, S.: LLM safety for children. In: Proc of the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track). pp. 809–821 (2025)

work page 2025
[43]

arXiv preprint arXiv:2502.01633 (2025) 20 Sakib et al

Sabbaghi, M., Kassianik, P., Pappas, G., Singer, Y., Karbasi, A., Hassani, H.: Adversarial reasoning at jailbreaking time. arXiv preprint arXiv:2502.01633 (2025) 20 Sakib et al

work page arXiv 2025
[44]

Sakib, S.: Threat: Targeted harmful generation via reframing and exploitation of adversarial tactics (2026), https://github.com/Shahanewaz/THREAT_Codes

work page 2026
[45]

In: accepted IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW) (2026)

Sakib, S.K., Ahmed, S., Das, A.B.: An iterative framework for controlled attribute transformation using multimodal models. In: accepted IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW) (2026)

work page 2026
[46]

In: IEEE International Conference on Big Data (BigData)

Sakib, S.K., Das, A.B.: Challenging fairness: A comprehensive exploration of bias in LLM-based recommendations. In: IEEE International Conference on Big Data (BigData). pp. 1585–1592 (2024)

work page 2024
[47]

In: Trustworthy Natural Language Processing (TrustNLP) Colocated with NAACL (2025)

Sakib, S.K., Das, A.B., Ahmed, S.: Battling misinformation: An empirical study on adversarial factuality in open-source large language models. In: Trustworthy Natural Language Processing (TrustNLP) Colocated with NAACL (2025)

work page 2025
[48]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

In: First Conference on Language Modeling

Tunstall, L., Beeching, E.E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., Von Werra, L., Fourrier, C., Habib, N., et al.: Zephyr: Direct distillation of lm alignment. In: First Conference on Language Modeling

work page
[50]

Advances in Neural Information Processing Systems 37, 5210–5243 (2024)

Wang, J., Li, J., Li, Y., Qi, X., Hu, J., Li, S., McDaniel, P., Chen, M., Li, B., Xiao, C.: Backdooralign: Mitigating fine-tuning based jailbreak attack with backdoor enhanced safety alignment. Advances in Neural Information Processing Systems 37, 5210–5243 (2024)

work page 2024
[51]

In: 34th USENIX Security Symposium

Wang, X., Wu, D., Ji, Z., Li, Z., Ma, P., Wang, S., Li, Y., Liu, Y., Liu, N., Rahmel, J.: SelfDefend:LLMs can defend themselves against jailbreaking in a practical manner. In: 34th USENIX Security Symposium. pp. 2441–2460 (2025)

work page 2025
[52]

Advances in Neural Information Processing Systems36, 51008–51025 (2023)

Wen, Y., Jain, N., Kirchenbauer, J., Goldblum, M., Geiping, J., Goldstein, T.: Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. Advances in Neural Information Processing Systems36, 51008–51025 (2023)

work page 2023
[53]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Xiao, Z., Yang, Y., Chen, G., Chen, Y.: Distract large language models for automatic jailbreak attack. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 16230–16244 (2024)

work page 2024
[54]

ACM Transactions on Software Engineering and Methodology (2024)

Xu, H., Wang, S., Li, N., Wang, K., Zhao, Y., Chen, K., Yu, T., Liu, Y., Wang, H.: Large language models for cyber security: A systematic literature review. ACM Transactions on Software Engineering and Methodology (2024)

work page 2024
[55]

In: IEEE Symposium on Security and Privacy (S&P)

Yang, K., Tao, G., Chen, X., Xu, J.: Alleviating the fear of losing alignment in LLM fine-tuning. In: IEEE Symposium on Security and Privacy (S&P). pp. 2152–2170. IEEE (2025)

work page 2025
[56]

In: IEEE Symposium on Security and Privacy (S&P)

Yang, K., Tao, G., Chen, X., Xu, J.: Alleviating the fear of losing alignment in LLM fine-tuning. In: IEEE Symposium on Security and Privacy (S&P). pp. 2152–2170 (2025). https://doi.org/10.1109/SP61157.2025.00171

work page doi:10.1109/sp61157.2025.00171 2025
[57]

Y ., Zhao, X., and Lin, D

Yang, X., Wang, X., Zhang, Q., Petzold, L., Wang, W.Y., Zhao, X., Lin, D.: Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949 (2023)

work page arXiv 2023
[58]

In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Yao, D., Zhang, J., Harris, I.G., Carlsson, M.: FuzzLLM: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4485–4489 (2024)

work page 2024
[59]

Gpt- 4 is too smart to be safe: Stealthy chat with llms via cipher,

Yuan, Y., Jiao, W., Wang, W., Huang, J.t., He, P., et al.: Gpt-4 is too smart to be safe: Stealthy chat with LLMs via cipher. arXiv preprint arXiv:2308.06463 (2023) Adversarial Reframing 21

work page arXiv 2023
[60]

Advances in Neural Information Processing Systems37, 54947–54973 (2024)

Zeng, C., Tang, S., Yang, X., Chen, Y., Sun, Y., Xu, Z., Li, Y., Chen, H., Cheng, W., Xu, D.D.: Dald: Improving logits-based detector without logits from black-box LLMs. Advances in Neural Information Processing Systems37, 54947–54973 (2024)

work page 2024
[61]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Zhang, J., Wang, X., Ren, W., Jiang, L., Wang, D., Liu, K.: RATT: A thought structure for coherent and correct llm reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 26733–26741 (2025)

work page 2025
[62]

In: Findings of the Association for Computational Linguistics: NAACL 2025

Zhang, T., Cao, B., Cao, Y., Lin, L., et al.: WordGame: Efficient & effective LLM jailbreak via simultaneous obfuscation in query and response. In: Findings of the Association for Computational Linguistics: NAACL 2025. pp. 4779–4807 (2025)

work page 2025
[63]

In: International Conference on Machine Learning

Zhang, Y., Yu, T., Tian, H., Fu, C., Li, P., Zeng, J., Xie, W., Shi, Y., Zhang, H., Wu, J., et al.: MM-RLHF: The next step forward in multimodal LLM alignment. In: International Conference on Machine Learning. pp. 76625–76654. PMLR (2025)

work page 2025
[64]

arXiv preprint arXiv:2404.16369 (2024)

Zhou, Y., Huang, Z., Lu, F., Qin, Z., Wang, W.: Don’t say no: Jailbreaking LLM by suppressing refusal. arXiv preprint arXiv:2404.16369 (2024)

work page arXiv 2024
[65]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

Zou, A., Phan, L., Wang, J., Duenas, D., Lin, M., Andriushchenko, M., Kolter, J.Z., Fredrikson, M., Hendrycks, D.: Improving alignment and robustness with circuit breakers. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

work page 2024
[66]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

AI, M.: Mixtral of experts (2023), https://mistral.ai/news/mixtral-of-experts/, accessed: 2025-06-06

work page 2023

[2] [2]

arXiv preprint arXiv:2404.02151 (2024)

Andriushchenko, M., Croce, F., Flammarion, N.: Jailbreaking leading safety-aligned LLMs with simple adaptive attacks. arXiv preprint arXiv:2404.02151 (2024)

work page arXiv 2024

[3] [3]

Bhardwaj, R., Poria, S.: Red-teaming large language models using chain of utter- ances for safety-alignment (2023)

work page 2023

[4] [4]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al.: Deepseek LLM: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Cao, B., Cao, Y., Lin, L., Chen, J.: Defending against alignment-breaking attacks via robustly aligned llm. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 10542–10560 (2024)

work page 2024

[6] [6]

In: Findings of the Association for Computational Linguistics (ACL)

Chang, Z., Li, M., Liu, Y., Wang, J., Wang, Q., Liu, Y.: Play guessing game with LLM: Indirect jailbreak attack with implicit clues. In: Findings of the Association for Computational Linguistics (ACL). pp. 5135–5147 (2024)

work page 2024

[7] [7]

In: IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E.: Jailbreaking black box large language models in twenty queries. In: IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). pp. 23–42 (2025)

work page 2025

[8] [8]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Chen, Y., Li, H., Li, Y., Liu, Y., Song, Y., Hooi, B.: Topicattack: An indirect prompt injection attack via topic transition. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 7338–7356 (2025)

work page 2025

[9] [9]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Chen, Y., Li, H., Zheng, Z., Wu, D., Song, Y., Hooi, B.: Defense against prompt injection attack by leveraging attack techniques. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 18331–18347 (2025)

work page 2025

[10] [10]

Authorea Preprints (2025) 18 Sakib et al

Das, A.B., Sakib, S.K.: Breaking the shield: Vulnerabilities in content moderation for multimodal language models. Authorea Preprints (2025) 18 Sakib et al

work page 2025

[11] [11]

In: ICCV Workshop on Computer Vision for Automated Medical Diagnosis (2025)

Das, A.B., Sakib, S.K., Ahmed, S.: Trustworthy medical imaging with large lan- guage models: A study of hallucinations across modalities. In: ICCV Workshop on Computer Vision for Automated Medical Diagnosis (2025)

work page 2025

[12] [12]

In: Network and Distributed System Security (NDSS) Symposium (2024)

Deng, G., Liu, Y., Li, Y., Wang, K., Zhang, Y., Li, Z., Wang, H., Zhang, T., Liu, Y.: Masterkey: Automated jailbreaking of large language model chatbots. In: Network and Distributed System Security (NDSS) Symposium (2024)

work page 2024

[13] [13]

In: Proc

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proc. of North American chapter of the association for computational linguistics (NAACL): human language technologies. pp. 4171–4186 (2019)

work page 2019

[14] [14]

arXiv preprint arXiv:2311.08268 (2023)

Ding, P., Kuang, J., Ma, D., Cao, X., Xian, Y., Chen, J., Huang, S.: A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268 (2023)

work page arXiv 2023

[15] [15]

In: IEEE Symposium on Security and Privacy (SP)

Dong, Y., Meng, X., Yu, N., Li, Z., Guo, S.: Fuzz-testing meets LLM-based agents: An automated and efficient framework for jailbreaking text-to-image generation models. In: IEEE Symposium on Security and Privacy (SP). pp. 373–391 (2025)

work page 2025

[16] [16]

In: Proc of Conference of the Nations of the Ameri- cas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Ghosh, S., Varshney, P., Sreedhar, M.N., Padmakumar, A., Rebedea, T., Varghese, J.R., Parisien, C.: AEGIS2.0 : A diverse ai safety dataset and risks taxonomy for alignment of LLM guardrails. In: Proc of Conference of the Nations of the Ameri- cas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)....

work page 2025

[17] [17]

Gretel: Gretel synthetic safety alignment dataset (2024), https://huggingface.co/datasets/gretelai/gretel-safety-alignment-en-v1

work page 2024

[18] [18]

arXiv preprint arXiv:2402.08679 (2024)

Guo, X., Yu, F., Zhang, H., Qin, L., Hu, B.: Cold-attack: Jailbreaking LLMs with stealthiness and controllability. arXiv preprint arXiv:2402.08679 (2024)

work page arXiv 2024

[19] [19]

Advances in Neural Information Processing Systems37, 128260–128279 (2024)

Hayase, J., Borevković, E., Carlini, N., Tramèr, F., Nasr, M.: Query-based adver- sarial prompt generation. Advances in Neural Information Processing Systems37, 128260–128279 (2024)

work page 2024

[20] [20]

Advances in Neural Information Processing Systems37, 65072–65094 (2024)

Hsu, C.Y., Tsai, Y.L., Lin, C.H., Chen, P.Y., Yu, C.M., Huang, C.Y.: Safe lora: The silver lining of reducing safety risks when finetuning large language models. Advances in Neural Information Processing Systems37, 65072–65094 (2024)

work page 2024

[21] [21]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Hu, W., Li, H., Jing, H., Hu, Q., Zeng, Z., Han, S., Heli, X., Chu, T., Hu, P., Song, Y.: Context reasoner: Incentivizing reasoning capability for contextualized privacy and safety compliance via reinforcement learning. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 865–883 (2025)

work page 2025

[22] [22]

In: Proc

Huang, D., Shah, A., Araujo, A., Wagner, D., Sitawarin, C.: Stronger universal and transferable attacks by suppressing refusals. In: Proc. of Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 5850–5876 (2025)

work page 2025

[23] [23]

Advances in Neural Information Processing Systems (NeurIPS)36, 24678–24704 (2023)

Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., et al.: Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. Advances in Neural Information Processing Systems (NeurIPS)36, 24678–24704 (2023)

work page 2023

[24] [24]

arXiv preprint arXiv:2405.21018

Jia, X., Pang, T., Du, C., Huang, Y., Gu, J., Liu, Y., Cao, X., Lin, M.: Improved techniques for optimization-based jailbreaking on large language models. arXiv preprint arXiv:2405.21018 (2024)

work page arXiv 2024

[25] [25]

Jiang, F., Xu, Z., Niu, L., Xiang, Z., Ramasubramanian, B., Li, B., Poovendran, R.: ArtPrompt:ASCIIart-basedjailbreakattacksagainstalignedLLMs.In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 15157–15173 (2024) Adversarial Reframing 19

work page 2024

[26] [26]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Kang, M., Chen, Z., Li, B.: C-SafeGen: Certified safe LLM generation with claim- based streaming guardrails. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

work page 2025

[27] [27]

Certifying llm safety against adversar- ial prompting,

Kumar, A., Agarwal, C., Srinivas, S., Li, A.J., Feizi, S., Lakkaraju, H.: Certifying LLM safety against adversarial prompting. arXiv preprint arXiv:2309.02705 (2023)

work page arXiv 2023

[28] [28]

Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b

Lermen, S., Rogers-Smith, C., Ladish, J.: LoRA fine-tuning efficiently undoes safety training in LLaMA 2-chat 70b. arXiv preprint arXiv:2310.20624 (2023)

work page arXiv 2023

[29] [29]

arXiv preprint arXiv:2402.16914 (2024)

Li, X., Wang, R., Cheng, M., Zhou, T., Hsieh, C.J.: Drattack: Prompt decom- position and reconstruction makes powerful LLM jailbreakers. arXiv preprint arXiv:2402.16914 (2024)

work page arXiv 2024

[30] [30]

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

Li, X., Zhou, Z., Zhu, J., Yao, J., Liu, T., Han, B.: DeepInception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

In: Findings of the Association for Computational Linguistics: ACL 2025

Liu, S., Liu, H., Liu, A., Bingchen, D., Qi, Z., Yan, Y., Geng, H., Jiang, P., Liu, J., Hu, X.: A survey on proactive defense strategies against misinformation in large language models. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 18144–18155 (2025)

work page 2025

[32] [32]

In: 33rd USENIX Security Symposium (USENIX Security 24)

Liu, T., Zhang, Y., Zhao, Z., Dong, Y., Meng, G., Chen, K.: Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction. In: 33rd USENIX Security Symposium (USENIX Security 24). pp. 4711–4728 (2024)

work page 2024

[33] [33]

arXiv preprint arXiv:2410.05295 (2024)

Liu, X., Li, P., Suh, E., Vorobeychik, Y., Mao, Z., Jha, S., McDaniel, P., Sun, H., Li, B., Xiao, C.: Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak LLMs. arXiv preprint arXiv:2410.05295 (2024)

work page arXiv 2024

[34] [34]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Liu, X., Xu, N., Chen, M., Xiao, C.: Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Maini, P., Goyal, S., Sam, D., Robey, A., Savani, Y., Jiang, Y., Zou, A., Fredrikson, M., Lipton, Z.C., Kolter, J.Z.: Safety pretraining: Toward the next generation of safe AI. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

work page 2025

[36] [36]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al.: Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Advances in Neural Information Processing Systems37, 61065–61105 (2024)

Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., Karbasi, A.: Tree of attacks: Jailbreaking black-box LLMs automatically. Advances in Neural Information Processing Systems37, 61065–61105 (2024)

work page 2024

[38] [38]

In: Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency

Mehta, M., Giunchiglia, F.: Understanding gen alpha’s digital language: Evaluation of LLM safety systems for content moderation. In: Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. pp. 2863–2873 (2025)

work page 2025

[39] [39]

Journal of the Franklin Institute334(2), 307–318 (1997)

Menéndez, M.L., Pardo, J.A., Pardo, L., Pardo, M.d.C.: The Jensen-Shannon divergence. Journal of the Franklin Institute334(2), 307–318 (1997)

work page 1997

[40] [40]

OpenAI:Gpt-4omodeldocumentation.https://platform.openai.com/docs/models/gpt- 4o (2024), accessed: 2025-06-04

work page 2024

[41] [41]

Qi, X., Zeng, Y., Xie, T., Chen, P.Y., Jia, R., Mittal, P., Henderson, P.: Fine-tuning aligned language models compromises safety, even when users do not intend to! In: 12th International Conference on Learning Representations, ICLR 2024 (2024)

work page 2024

[42] [42]

In: Proc of the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

Rath, P., Shrawgi, H., Agrawal, P., Dandapat, S.: LLM safety for children. In: Proc of the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track). pp. 809–821 (2025)

work page 2025

[43] [43]

arXiv preprint arXiv:2502.01633 (2025) 20 Sakib et al

Sabbaghi, M., Kassianik, P., Pappas, G., Singer, Y., Karbasi, A., Hassani, H.: Adversarial reasoning at jailbreaking time. arXiv preprint arXiv:2502.01633 (2025) 20 Sakib et al

work page arXiv 2025

[44] [44]

Sakib, S.: Threat: Targeted harmful generation via reframing and exploitation of adversarial tactics (2026), https://github.com/Shahanewaz/THREAT_Codes

work page 2026

[45] [45]

In: accepted IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW) (2026)

Sakib, S.K., Ahmed, S., Das, A.B.: An iterative framework for controlled attribute transformation using multimodal models. In: accepted IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW) (2026)

work page 2026

[46] [46]

In: IEEE International Conference on Big Data (BigData)

Sakib, S.K., Das, A.B.: Challenging fairness: A comprehensive exploration of bias in LLM-based recommendations. In: IEEE International Conference on Big Data (BigData). pp. 1585–1592 (2024)

work page 2024

[47] [47]

In: Trustworthy Natural Language Processing (TrustNLP) Colocated with NAACL (2025)

Sakib, S.K., Das, A.B., Ahmed, S.: Battling misinformation: An empirical study on adversarial factuality in open-source large language models. In: Trustworthy Natural Language Processing (TrustNLP) Colocated with NAACL (2025)

work page 2025

[48] [48]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

In: First Conference on Language Modeling

Tunstall, L., Beeching, E.E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., Von Werra, L., Fourrier, C., Habib, N., et al.: Zephyr: Direct distillation of lm alignment. In: First Conference on Language Modeling

work page

[50] [50]

Advances in Neural Information Processing Systems 37, 5210–5243 (2024)

Wang, J., Li, J., Li, Y., Qi, X., Hu, J., Li, S., McDaniel, P., Chen, M., Li, B., Xiao, C.: Backdooralign: Mitigating fine-tuning based jailbreak attack with backdoor enhanced safety alignment. Advances in Neural Information Processing Systems 37, 5210–5243 (2024)

work page 2024

[51] [51]

In: 34th USENIX Security Symposium

Wang, X., Wu, D., Ji, Z., Li, Z., Ma, P., Wang, S., Li, Y., Liu, Y., Liu, N., Rahmel, J.: SelfDefend:LLMs can defend themselves against jailbreaking in a practical manner. In: 34th USENIX Security Symposium. pp. 2441–2460 (2025)

work page 2025

[52] [52]

Advances in Neural Information Processing Systems36, 51008–51025 (2023)

Wen, Y., Jain, N., Kirchenbauer, J., Goldblum, M., Geiping, J., Goldstein, T.: Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. Advances in Neural Information Processing Systems36, 51008–51025 (2023)

work page 2023

[53] [53]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Xiao, Z., Yang, Y., Chen, G., Chen, Y.: Distract large language models for automatic jailbreak attack. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 16230–16244 (2024)

work page 2024

[54] [54]

ACM Transactions on Software Engineering and Methodology (2024)

Xu, H., Wang, S., Li, N., Wang, K., Zhao, Y., Chen, K., Yu, T., Liu, Y., Wang, H.: Large language models for cyber security: A systematic literature review. ACM Transactions on Software Engineering and Methodology (2024)

work page 2024

[55] [55]

In: IEEE Symposium on Security and Privacy (S&P)

Yang, K., Tao, G., Chen, X., Xu, J.: Alleviating the fear of losing alignment in LLM fine-tuning. In: IEEE Symposium on Security and Privacy (S&P). pp. 2152–2170. IEEE (2025)

work page 2025

[56] [56]

In: IEEE Symposium on Security and Privacy (S&P)

Yang, K., Tao, G., Chen, X., Xu, J.: Alleviating the fear of losing alignment in LLM fine-tuning. In: IEEE Symposium on Security and Privacy (S&P). pp. 2152–2170 (2025). https://doi.org/10.1109/SP61157.2025.00171

work page doi:10.1109/sp61157.2025.00171 2025

[57] [57]

Y ., Zhao, X., and Lin, D

Yang, X., Wang, X., Zhang, Q., Petzold, L., Wang, W.Y., Zhao, X., Lin, D.: Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949 (2023)

work page arXiv 2023

[58] [58]

In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Yao, D., Zhang, J., Harris, I.G., Carlsson, M.: FuzzLLM: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4485–4489 (2024)

work page 2024

[59] [59]

Gpt- 4 is too smart to be safe: Stealthy chat with llms via cipher,

Yuan, Y., Jiao, W., Wang, W., Huang, J.t., He, P., et al.: Gpt-4 is too smart to be safe: Stealthy chat with LLMs via cipher. arXiv preprint arXiv:2308.06463 (2023) Adversarial Reframing 21

work page arXiv 2023

[60] [60]

Advances in Neural Information Processing Systems37, 54947–54973 (2024)

Zeng, C., Tang, S., Yang, X., Chen, Y., Sun, Y., Xu, Z., Li, Y., Chen, H., Cheng, W., Xu, D.D.: Dald: Improving logits-based detector without logits from black-box LLMs. Advances in Neural Information Processing Systems37, 54947–54973 (2024)

work page 2024

[61] [61]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Zhang, J., Wang, X., Ren, W., Jiang, L., Wang, D., Liu, K.: RATT: A thought structure for coherent and correct llm reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 26733–26741 (2025)

work page 2025

[62] [62]

In: Findings of the Association for Computational Linguistics: NAACL 2025

Zhang, T., Cao, B., Cao, Y., Lin, L., et al.: WordGame: Efficient & effective LLM jailbreak via simultaneous obfuscation in query and response. In: Findings of the Association for Computational Linguistics: NAACL 2025. pp. 4779–4807 (2025)

work page 2025

[63] [63]

In: International Conference on Machine Learning

Zhang, Y., Yu, T., Tian, H., Fu, C., Li, P., Zeng, J., Xie, W., Shi, Y., Zhang, H., Wu, J., et al.: MM-RLHF: The next step forward in multimodal LLM alignment. In: International Conference on Machine Learning. pp. 76625–76654. PMLR (2025)

work page 2025

[64] [64]

arXiv preprint arXiv:2404.16369 (2024)

Zhou, Y., Huang, Z., Lu, F., Qin, Z., Wang, W.: Don’t say no: Jailbreaking LLM by suppressing refusal. arXiv preprint arXiv:2404.16369 (2024)

work page arXiv 2024

[65] [65]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

Zou, A., Phan, L., Wang, J., Duenas, D., Lin, M., Andriushchenko, M., Kolter, J.Z., Fredrikson, M., Hendrycks, D.: Improving alignment and robustness with circuit breakers. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

work page 2024

[66] [66]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023