{"paper":{"title":"Low-Resource Languages Jailbreak GPT-4","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Translating harmful English prompts into low-resource languages lets GPT-4 provide actionable advice for bad goals 79 percent of the time.","cross_cats":["cs.AI","cs.CR","cs.LG"],"primary_cat":"cs.CL","authors_text":"Cristina Menghini, Stephen H. Bach, Zheng-Xin Yong","submitted_at":"2023-10-03T21:30:56Z","abstract_excerpt":"AI safety training and red-teaming of large language models (LLMs) are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4's safeguard through translating unsafe English inputs into low-resource languages. On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpass"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That automatic translations preserve the original unsafe intent without introducing detectable artifacts or triggering the model's cross-lingual safety mechanisms, and that results on the tested benchmark generalize to other prompts and models.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Translating harmful English prompts into low-resource languages lets GPT-4 provide actionable advice for bad goals 79 percent of the time.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5acc8099a82390bf1e378b69ad7a2679c90d48e86db7de89c166a8b17df892d1"},"source":{"id":"2310.02446","kind":"arxiv","version":2},"verdict":{"id":"fe7316e4-a497-4337-b3d2-1f0fc27eff95","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T09:12:43.135443Z","strongest_claim":"On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks.","one_line_summary":"Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That automatic translations preserve the original unsafe intent without introducing detectable artifacts or triggering the model's cross-lingual safety mechanisms, and that results on the tested benchmark generalize to other prompts and models.","pith_extraction_headline":"Translating harmful English prompts into low-resource languages lets GPT-4 provide actionable advice for bad goals 79 percent of the time."},"references":{"count":55,"sample":[{"doi":"","year":2020,"title":"Jigsaw multilingual toxic comment classification, 2020","work_id":"acd2de50-2c59-482c-82e8-153db57bc84d","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","ref_index":2,"cited_arxiv_id":"2204.05862","is_internal_anchor":true},{"doi":"","year":2022,"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","ref_index":3,"cited_arxiv_id":"2212.08073","is_internal_anchor":true},{"doi":"","year":2023,"title":"A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity","work_id":"41dff74c-00b2-4c77-a674-9f86030c06c8","ref_index":4,"cited_arxiv_id":"2302.04023","is_internal_anchor":true},{"doi":"","year":2022,"title":"X., Macherey, K., Krikun, M., Wang, P., Gutkin, A., Shah, A., Huang, Y., Chen, Z., Wu, Y., and Hughes, M","work_id":"c55757ac-fd65-4628-8d13-7ce5b6c20ff0","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":55,"snapshot_sha256":"7c923c5d51f1b6baab3a00891cd1dffa7a58bf223887463a1b27ff32553e91a3","internal_anchors":13},"formal_canon":{"evidence_count":2,"snapshot_sha256":"685cae991da7393e3601bb01782359bffbed4eef16ae8255aba754e370d4704d"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}