Adversarial Reframing: A Framework for Targeted Generation in Language Models
Pith reviewed 2026-05-22 09:32 UTC · model grok-4.3
The pith
THREAT coordinates multiple LLMs in an iterative loop to discover jailbreak prompts that succeed more often and evade safety filters better than earlier methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
THREAT formulates prompt discovery as a nonconvex optimization problem and solves it through an iterative multi-LLM coordination loop, yielding jailbreak prompts that deliver higher attack success rates with lower runtime and are flagged as harmful in fewer than 1 percent of cases versus roughly 50 percent for unmodified prompts across tested datasets and model architectures.
What carries the argument
The iterative multi-LLM coordination loop that solves the nonconvex optimization for discovering effective jailbreak prompts.
If this is right
- Attack success rates rise across diverse datasets and model architectures relative to prior jailbreaking techniques.
- Computational cost drops while maintaining or improving effectiveness.
- Generated prompts trigger safety refusals in under 1 percent of cases.
- Vulnerabilities in aligned LLMs that were previously undetected become measurable.
- The framework provides a practical means to test and strengthen model safety defenses.
Where Pith is reading between the lines
- The same coordination approach might transfer to testing other safety properties such as resistance to misinformation prompts.
- Safety filters may be tuned primarily to direct harmful language rather than reframed versions that preserve intent.
- Developers could adapt the optimization loop to generate training data for more robust refusal mechanisms.
Load-bearing premise
The nonconvex optimization and iterative coordination produce prompts whose higher success and low detection rates generalize beyond the tested models and datasets without the search process itself creating detectable artifacts.
What would settle it
Applying the generated prompts to a new LLM architecture and dataset outside the original experiments and finding no gain in success rate or a flagging rate near 50 percent would falsify the generalization claim.
Figures
read the original abstract
Large Language Models (LLMs) are widely deployed in diverse real-world settings, yet remain vulnerable to jailbreaking, where prompt-based attacks bypass safety filters. We present THREAT (Targeted Harmful generation via Reframing and Exploitation of Adversarial Tactics), a reasoning-driven framework that coordinates multiple LLMs in an iterative search loop to find textual jailbreak prompts. We formulate prompt discovery as a nonconvex optimization problem and provide an efficient solution that lowers runtime and improves attack effectiveness. Across diverse datasets and model architectures, THREAT delivers higher attack success rates with lower computational cost than prior methods. The crafted prompts were flagged as harmful in fewer than 1% of cases, compared with about 50% refusals for the corresponding unmodified prompts. These findings reveal previously undetected vulnerabilities in aligned LLMs and position THREAT as a practical tool for proactively strengthening the safety of foundation models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces THREAT, a framework for targeted harmful generation in language models. It coordinates multiple LLMs in an iterative search loop to discover jailbreak prompts, formulated as a nonconvex optimization problem. The authors claim higher attack success rates and lower computational cost than prior methods across diverse datasets and model architectures, with crafted prompts flagged as harmful in fewer than 1% of cases versus about 50% refusals for unmodified prompts.
Significance. If the empirical claims hold after detailed verification, the work would be significant for AI safety research by exposing vulnerabilities in aligned LLMs and offering a practical red-teaming tool. The multi-LLM iterative coordination and nonconvex optimization framing represent a novel approach to prompt discovery. The reported stealth property (<1% flagging) is a notable strength if reproducible.
major comments (2)
- Abstract: The abstract states performance numbers and a nonconvex optimization framing but supplies no concrete algorithm steps, baseline comparisons, statistical tests, or error analysis, so the data cannot be checked for support of the central claims of higher ASR and lower cost.
- Method section (iterative coordination loop): The central cross-architecture generalization claim depends on the iterative multi-LLM loop producing prompts without undisclosed model-specific tuning of free parameters such as iteration count and coordination schedule; if effectiveness arises from implicit adaptation to target refusal patterns rather than the nonconvex solver, the reported gains and stealth would not transfer, undermining the claim.
minor comments (1)
- Abstract: Define 'attack success rate' and 'computational cost' more precisely to allow direct comparison with baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below in a point-by-point manner, indicating planned revisions where appropriate to improve clarity and verifiability.
read point-by-point responses
-
Referee: Abstract: The abstract states performance numbers and a nonconvex optimization framing but supplies no concrete algorithm steps, baseline comparisons, statistical tests, or error analysis, so the data cannot be checked for support of the central claims of higher ASR and lower cost.
Authors: We acknowledge that the abstract's brevity limits the inclusion of full details. In the revised manuscript, we will expand the abstract to include a concise outline of the iterative multi-LLM coordination steps and explicitly name the primary baselines and datasets. Full baseline comparisons, statistical tests (including means, standard deviations, and significance levels), and error analysis are presented in Sections 4 and 5 with supporting tables; we will add a cross-reference in the abstract to direct readers to these sections for verification of the ASR and cost claims. revision: yes
-
Referee: Method section (iterative coordination loop): The central cross-architecture generalization claim depends on the iterative multi-LLM loop producing prompts without undisclosed model-specific tuning of free parameters such as iteration count and coordination schedule; if effectiveness arises from implicit adaptation to target refusal patterns rather than the nonconvex solver, the reported gains and stealth would not transfer, undermining the claim.
Authors: Section 3.2 of the manuscript states that hyperparameters including iteration count and coordination schedule are fixed values determined on a held-out validation set and applied uniformly across all target models and architectures with no per-model tuning. This fixed-parameter design supports the reported cross-architecture generalization. To further address the concern, we will add an ablation study in the revision showing performance stability across a range of iteration counts and schedules, confirming that gains derive from the nonconvex optimization and multi-LLM coordination rather than implicit adaptation. We maintain that no undisclosed model-specific tuning occurs. revision: partial
Circularity Check
No circularity in claimed derivation or results
full rationale
The abstract and available text present THREAT as an empirical framework that formulates prompt search as nonconvex optimization and reports measured attack success rates, runtime, and detection rates across datasets and models. No equations, fitted parameters, or self-citation chains are exhibited that would reduce the reported performance metrics to inputs by construction. The central claims rest on experimental comparisons rather than any definitional equivalence or renamed known result. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- iteration count and coordination schedule
axioms (1)
- domain assumption Multiple LLMs can be prompted to improve each other's jailbreak attempts without the process being blocked by their own safety filters
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formulate prompt discovery as a nonconvex optimization problem... maximize R(xi−1, xi−1 + δi) subject to ε1 < S(xi−1, xi) < ε2
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
THREAT coordinates multiple LLMs in an iterative search loop
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
AI, M.: Mixtral of experts (2023), https://mistral.ai/news/mixtral-of-experts/, accessed: 2025-06-06
work page 2023
-
[2]
arXiv preprint arXiv:2404.02151 (2024)
Andriushchenko, M., Croce, F., Flammarion, N.: Jailbreaking leading safety-aligned LLMs with simple adaptive attacks. arXiv preprint arXiv:2404.02151 (2024)
-
[3]
Bhardwaj, R., Poria, S.: Red-teaming large language models using chain of utter- ances for safety-alignment (2023)
work page 2023
-
[4]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al.: Deepseek LLM: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Cao, B., Cao, Y., Lin, L., Chen, J.: Defending against alignment-breaking attacks via robustly aligned llm. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 10542–10560 (2024)
work page 2024
-
[6]
In: Findings of the Association for Computational Linguistics (ACL)
Chang, Z., Li, M., Liu, Y., Wang, J., Wang, Q., Liu, Y.: Play guessing game with LLM: Indirect jailbreak attack with implicit clues. In: Findings of the Association for Computational Linguistics (ACL). pp. 5135–5147 (2024)
work page 2024
-
[7]
In: IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)
Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E.: Jailbreaking black box large language models in twenty queries. In: IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). pp. 23–42 (2025)
work page 2025
-
[8]
In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Chen, Y., Li, H., Li, Y., Liu, Y., Song, Y., Hooi, B.: Topicattack: An indirect prompt injection attack via topic transition. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 7338–7356 (2025)
work page 2025
-
[9]
Chen, Y., Li, H., Zheng, Z., Wu, D., Song, Y., Hooi, B.: Defense against prompt injection attack by leveraging attack techniques. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 18331–18347 (2025)
work page 2025
-
[10]
Authorea Preprints (2025) 18 Sakib et al
Das, A.B., Sakib, S.K.: Breaking the shield: Vulnerabilities in content moderation for multimodal language models. Authorea Preprints (2025) 18 Sakib et al
work page 2025
-
[11]
In: ICCV Workshop on Computer Vision for Automated Medical Diagnosis (2025)
Das, A.B., Sakib, S.K., Ahmed, S.: Trustworthy medical imaging with large lan- guage models: A study of hallucinations across modalities. In: ICCV Workshop on Computer Vision for Automated Medical Diagnosis (2025)
work page 2025
-
[12]
In: Network and Distributed System Security (NDSS) Symposium (2024)
Deng, G., Liu, Y., Li, Y., Wang, K., Zhang, Y., Li, Z., Wang, H., Zhang, T., Liu, Y.: Masterkey: Automated jailbreaking of large language model chatbots. In: Network and Distributed System Security (NDSS) Symposium (2024)
work page 2024
- [13]
-
[14]
arXiv preprint arXiv:2311.08268 (2023)
Ding, P., Kuang, J., Ma, D., Cao, X., Xian, Y., Chen, J., Huang, S.: A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268 (2023)
-
[15]
In: IEEE Symposium on Security and Privacy (SP)
Dong, Y., Meng, X., Yu, N., Li, Z., Guo, S.: Fuzz-testing meets LLM-based agents: An automated and efficient framework for jailbreaking text-to-image generation models. In: IEEE Symposium on Security and Privacy (SP). pp. 373–391 (2025)
work page 2025
-
[16]
Ghosh, S., Varshney, P., Sreedhar, M.N., Padmakumar, A., Rebedea, T., Varghese, J.R., Parisien, C.: AEGIS2.0 : A diverse ai safety dataset and risks taxonomy for alignment of LLM guardrails. In: Proc of Conference of the Nations of the Ameri- cas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)....
work page 2025
-
[17]
Gretel: Gretel synthetic safety alignment dataset (2024), https://huggingface.co/datasets/gretelai/gretel-safety-alignment-en-v1
work page 2024
-
[18]
arXiv preprint arXiv:2402.08679 (2024)
Guo, X., Yu, F., Zhang, H., Qin, L., Hu, B.: Cold-attack: Jailbreaking LLMs with stealthiness and controllability. arXiv preprint arXiv:2402.08679 (2024)
-
[19]
Advances in Neural Information Processing Systems37, 128260–128279 (2024)
Hayase, J., Borevković, E., Carlini, N., Tramèr, F., Nasr, M.: Query-based adver- sarial prompt generation. Advances in Neural Information Processing Systems37, 128260–128279 (2024)
work page 2024
-
[20]
Advances in Neural Information Processing Systems37, 65072–65094 (2024)
Hsu, C.Y., Tsai, Y.L., Lin, C.H., Chen, P.Y., Yu, C.M., Huang, C.Y.: Safe lora: The silver lining of reducing safety risks when finetuning large language models. Advances in Neural Information Processing Systems37, 65072–65094 (2024)
work page 2024
-
[21]
In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Hu, W., Li, H., Jing, H., Hu, Q., Zeng, Z., Han, S., Heli, X., Chu, T., Hu, P., Song, Y.: Context reasoner: Incentivizing reasoning capability for contextualized privacy and safety compliance via reinforcement learning. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 865–883 (2025)
work page 2025
-
[22]
Huang, D., Shah, A., Araujo, A., Wagner, D., Sitawarin, C.: Stronger universal and transferable attacks by suppressing refusals. In: Proc. of Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 5850–5876 (2025)
work page 2025
-
[23]
Advances in Neural Information Processing Systems (NeurIPS)36, 24678–24704 (2023)
Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., et al.: Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. Advances in Neural Information Processing Systems (NeurIPS)36, 24678–24704 (2023)
work page 2023
-
[24]
arXiv preprint arXiv:2405.21018
Jia, X., Pang, T., Du, C., Huang, Y., Gu, J., Liu, Y., Cao, X., Lin, M.: Improved techniques for optimization-based jailbreaking on large language models. arXiv preprint arXiv:2405.21018 (2024)
-
[25]
Jiang, F., Xu, Z., Niu, L., Xiang, Z., Ramasubramanian, B., Li, B., Poovendran, R.: ArtPrompt:ASCIIart-basedjailbreakattacksagainstalignedLLMs.In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 15157–15173 (2024) Adversarial Reframing 19
work page 2024
-
[26]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
Kang, M., Chen, Z., Li, B.: C-SafeGen: Certified safe LLM generation with claim- based streaming guardrails. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
work page 2025
-
[27]
Certifying llm safety against adversar- ial prompting,
Kumar, A., Agarwal, C., Srinivas, S., Li, A.J., Feizi, S., Lakkaraju, H.: Certifying LLM safety against adversarial prompting. arXiv preprint arXiv:2309.02705 (2023)
-
[28]
Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b
Lermen, S., Rogers-Smith, C., Ladish, J.: LoRA fine-tuning efficiently undoes safety training in LLaMA 2-chat 70b. arXiv preprint arXiv:2310.20624 (2023)
-
[29]
arXiv preprint arXiv:2402.16914 (2024)
Li, X., Wang, R., Cheng, M., Zhou, T., Hsieh, C.J.: Drattack: Prompt decom- position and reconstruction makes powerful LLM jailbreakers. arXiv preprint arXiv:2402.16914 (2024)
-
[30]
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
Li, X., Zhou, Z., Zhu, J., Yao, J., Liu, T., Han, B.: DeepInception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
In: Findings of the Association for Computational Linguistics: ACL 2025
Liu, S., Liu, H., Liu, A., Bingchen, D., Qi, Z., Yan, Y., Geng, H., Jiang, P., Liu, J., Hu, X.: A survey on proactive defense strategies against misinformation in large language models. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 18144–18155 (2025)
work page 2025
-
[32]
In: 33rd USENIX Security Symposium (USENIX Security 24)
Liu, T., Zhang, Y., Zhao, Z., Dong, Y., Meng, G., Chen, K.: Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction. In: 33rd USENIX Security Symposium (USENIX Security 24). pp. 4711–4728 (2024)
work page 2024
-
[33]
arXiv preprint arXiv:2410.05295 (2024)
Liu, X., Li, P., Suh, E., Vorobeychik, Y., Mao, Z., Jha, S., McDaniel, P., Sun, H., Li, B., Xiao, C.: Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak LLMs. arXiv preprint arXiv:2410.05295 (2024)
-
[34]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Liu, X., Xu, N., Chen, M., Xiao, C.: Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
Maini, P., Goyal, S., Sam, D., Robey, A., Savani, Y., Jiang, Y., Zou, A., Fredrikson, M., Lipton, Z.C., Kolter, J.Z.: Safety pretraining: Toward the next generation of safe AI. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
work page 2025
-
[36]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al.: Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Advances in Neural Information Processing Systems37, 61065–61105 (2024)
Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., Karbasi, A.: Tree of attacks: Jailbreaking black-box LLMs automatically. Advances in Neural Information Processing Systems37, 61065–61105 (2024)
work page 2024
-
[38]
In: Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency
Mehta, M., Giunchiglia, F.: Understanding gen alpha’s digital language: Evaluation of LLM safety systems for content moderation. In: Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. pp. 2863–2873 (2025)
work page 2025
-
[39]
Journal of the Franklin Institute334(2), 307–318 (1997)
Menéndez, M.L., Pardo, J.A., Pardo, L., Pardo, M.d.C.: The Jensen-Shannon divergence. Journal of the Franklin Institute334(2), 307–318 (1997)
work page 1997
-
[40]
OpenAI:Gpt-4omodeldocumentation.https://platform.openai.com/docs/models/gpt- 4o (2024), accessed: 2025-06-04
work page 2024
-
[41]
Qi, X., Zeng, Y., Xie, T., Chen, P.Y., Jia, R., Mittal, P., Henderson, P.: Fine-tuning aligned language models compromises safety, even when users do not intend to! In: 12th International Conference on Learning Representations, ICLR 2024 (2024)
work page 2024
-
[42]
Rath, P., Shrawgi, H., Agrawal, P., Dandapat, S.: LLM safety for children. In: Proc of the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track). pp. 809–821 (2025)
work page 2025
-
[43]
arXiv preprint arXiv:2502.01633 (2025) 20 Sakib et al
Sabbaghi, M., Kassianik, P., Pappas, G., Singer, Y., Karbasi, A., Hassani, H.: Adversarial reasoning at jailbreaking time. arXiv preprint arXiv:2502.01633 (2025) 20 Sakib et al
-
[44]
Sakib, S.: Threat: Targeted harmful generation via reframing and exploitation of adversarial tactics (2026), https://github.com/Shahanewaz/THREAT_Codes
work page 2026
-
[45]
In: accepted IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW) (2026)
Sakib, S.K., Ahmed, S., Das, A.B.: An iterative framework for controlled attribute transformation using multimodal models. In: accepted IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW) (2026)
work page 2026
-
[46]
In: IEEE International Conference on Big Data (BigData)
Sakib, S.K., Das, A.B.: Challenging fairness: A comprehensive exploration of bias in LLM-based recommendations. In: IEEE International Conference on Big Data (BigData). pp. 1585–1592 (2024)
work page 2024
-
[47]
In: Trustworthy Natural Language Processing (TrustNLP) Colocated with NAACL (2025)
Sakib, S.K., Das, A.B., Ahmed, S.: Battling misinformation: An empirical study on adversarial factuality in open-source large language models. In: Trustworthy Natural Language Processing (TrustNLP) Colocated with NAACL (2025)
work page 2025
-
[48]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
In: First Conference on Language Modeling
Tunstall, L., Beeching, E.E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., Von Werra, L., Fourrier, C., Habib, N., et al.: Zephyr: Direct distillation of lm alignment. In: First Conference on Language Modeling
-
[50]
Advances in Neural Information Processing Systems 37, 5210–5243 (2024)
Wang, J., Li, J., Li, Y., Qi, X., Hu, J., Li, S., McDaniel, P., Chen, M., Li, B., Xiao, C.: Backdooralign: Mitigating fine-tuning based jailbreak attack with backdoor enhanced safety alignment. Advances in Neural Information Processing Systems 37, 5210–5243 (2024)
work page 2024
-
[51]
In: 34th USENIX Security Symposium
Wang, X., Wu, D., Ji, Z., Li, Z., Ma, P., Wang, S., Li, Y., Liu, Y., Liu, N., Rahmel, J.: SelfDefend:LLMs can defend themselves against jailbreaking in a practical manner. In: 34th USENIX Security Symposium. pp. 2441–2460 (2025)
work page 2025
-
[52]
Advances in Neural Information Processing Systems36, 51008–51025 (2023)
Wen, Y., Jain, N., Kirchenbauer, J., Goldblum, M., Geiping, J., Goldstein, T.: Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. Advances in Neural Information Processing Systems36, 51008–51025 (2023)
work page 2023
-
[53]
In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Xiao, Z., Yang, Y., Chen, G., Chen, Y.: Distract large language models for automatic jailbreak attack. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 16230–16244 (2024)
work page 2024
-
[54]
ACM Transactions on Software Engineering and Methodology (2024)
Xu, H., Wang, S., Li, N., Wang, K., Zhao, Y., Chen, K., Yu, T., Liu, Y., Wang, H.: Large language models for cyber security: A systematic literature review. ACM Transactions on Software Engineering and Methodology (2024)
work page 2024
-
[55]
In: IEEE Symposium on Security and Privacy (S&P)
Yang, K., Tao, G., Chen, X., Xu, J.: Alleviating the fear of losing alignment in LLM fine-tuning. In: IEEE Symposium on Security and Privacy (S&P). pp. 2152–2170. IEEE (2025)
work page 2025
-
[56]
In: IEEE Symposium on Security and Privacy (S&P)
Yang, K., Tao, G., Chen, X., Xu, J.: Alleviating the fear of losing alignment in LLM fine-tuning. In: IEEE Symposium on Security and Privacy (S&P). pp. 2152–2170 (2025). https://doi.org/10.1109/SP61157.2025.00171
-
[57]
Yang, X., Wang, X., Zhang, Q., Petzold, L., Wang, W.Y., Zhao, X., Lin, D.: Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949 (2023)
-
[58]
In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Yao, D., Zhang, J., Harris, I.G., Carlsson, M.: FuzzLLM: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4485–4489 (2024)
work page 2024
-
[59]
Gpt- 4 is too smart to be safe: Stealthy chat with llms via cipher,
Yuan, Y., Jiao, W., Wang, W., Huang, J.t., He, P., et al.: Gpt-4 is too smart to be safe: Stealthy chat with LLMs via cipher. arXiv preprint arXiv:2308.06463 (2023) Adversarial Reframing 21
-
[60]
Advances in Neural Information Processing Systems37, 54947–54973 (2024)
Zeng, C., Tang, S., Yang, X., Chen, Y., Sun, Y., Xu, Z., Li, Y., Chen, H., Cheng, W., Xu, D.D.: Dald: Improving logits-based detector without logits from black-box LLMs. Advances in Neural Information Processing Systems37, 54947–54973 (2024)
work page 2024
-
[61]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Zhang, J., Wang, X., Ren, W., Jiang, L., Wang, D., Liu, K.: RATT: A thought structure for coherent and correct llm reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 26733–26741 (2025)
work page 2025
-
[62]
In: Findings of the Association for Computational Linguistics: NAACL 2025
Zhang, T., Cao, B., Cao, Y., Lin, L., et al.: WordGame: Efficient & effective LLM jailbreak via simultaneous obfuscation in query and response. In: Findings of the Association for Computational Linguistics: NAACL 2025. pp. 4779–4807 (2025)
work page 2025
-
[63]
In: International Conference on Machine Learning
Zhang, Y., Yu, T., Tian, H., Fu, C., Li, P., Zeng, J., Xie, W., Shi, Y., Zhang, H., Wu, J., et al.: MM-RLHF: The next step forward in multimodal LLM alignment. In: International Conference on Machine Learning. pp. 76625–76654. PMLR (2025)
work page 2025
-
[64]
arXiv preprint arXiv:2404.16369 (2024)
Zhou, Y., Huang, Z., Lu, F., Qin, Z., Wang, W.: Don’t say no: Jailbreaking LLM by suppressing refusal. arXiv preprint arXiv:2404.16369 (2024)
-
[65]
In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)
Zou, A., Phan, L., Wang, J., Duenas, D., Lin, M., Andriushchenko, M., Kolter, J.Z., Fredrikson, M., Hendrycks, D.: Improving alignment and robustness with circuit breakers. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)
work page 2024
-
[66]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.