pith. sign in

arxiv: 2605.18239 · v1 · pith:F33LUR3Znew · submitted 2026-05-18 · 💻 cs.CL · cs.AI

Multilingual jailbreaking of LLMs using low-resource languages

Pith reviewed 2026-05-20 10:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords jailbreakingLLMslow-resource languagesmultilingual safetyred-teamingAfrican languagessafety guardrailstranslation quality
0
0 comments X

The pith

Multi-turn conversations in low-resource African languages jailbreak LLMs at rates up to 83.6% harmful responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether translating English jailbreak prompts into low-resource languages and using multi-turn dialogues can bypass safety filters in models such as GPT-4o-mini, Claude 3.5 Haiku, and others. Single-turn translations yield low success, but extended conversations raise harmful response rates to between 52.7% and 83.6% depending on the model and language. Human red-teaming by native speakers further lifts the average success rate from 59.8% to 75.8%, with the largest gains in Afrikaans and isiZulu. Translation quality is identified as the main limiter of attack effectiveness.

Core claim

Large Language Models remain vulnerable to jailbreak attempts that circumvent safety guardrails when prompts are delivered through multi-turn conversations in low-resource languages. Translating existing datasets into Afrikaans, Kiswahili, isiXhosa, and isiZulu and testing them on commercial LLMs produces English harmful response rates from 52.7% (Claude 3.5 Haiku) to 83.6% (GPT-4o-mini), with similar rates in the native languages; human red-teaming raises the overall average from 59.8% to 75.8%.

What carries the argument

Multi-turn conversation structure in low-resource languages, where iterative prompting after initial translation allows gradual escalation that single-turn attacks cannot achieve.

If this is right

  • Safety mechanisms that resist single-turn English prompts become less reliable once the interaction stretches across multiple turns in another language.
  • Human red-teaming raises average jailbreak rates by roughly 16 percentage points across the tested languages.
  • Translation quality directly controls how often the attack succeeds, so better machine translation could reduce but not remove the vulnerability.
  • Vulnerabilities persist across commercial models even when the user language is not English.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety training for LLMs may need explicit exposure to low-resource language patterns to close these gaps.
  • Detection systems could require separate calibration for each language family to avoid under- or over-flagging content.
  • The same multi-turn technique might transfer to other low-resource languages outside the four African ones studied here.

Load-bearing premise

That automated detection of harmful responses stays consistent and unbiased when applied to text in low-resource languages.

What would settle it

Compare jailbreak success rates for the identical multi-turn strategy when using professional high-quality translations versus the automated translations used in the study.

Figures

Figures reproduced from arXiv: 2605.18239 by Dylan Marx, Marcel Dunaiski.

Figure 1
Figure 1. Figure 1: Harmful response rates per model and language using the single-turn MultiJail dataset. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Harmful response rates per model and language using the conversation subset of the multi-turn MHJ dataset [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of BERTScores by language for translated multi-turn prompts [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of BLEU score by language for translated multi-turn prompts [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of METEOR score by language for translated multi-turn prompts [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

Large Language Models (LLMs) remain vulnerable to jailbreak attempts that circumvent safety guardrails. We investigate whether multi-turn conversations using low-resource African languages (Afrikaans, Kiswahili, isiXhosa, and isiZulu) can bypass safety mechanisms across commercial LLMs. We translated prompts from existing datasets and evaluated ChatGPT, Claude, DeepSeek, Gemini, and Grok through automated testing and human red-teaming with native speakers. Single-turn translation attacks proved ineffective, while multi-turn conversations achieved English harmful response rates from 52.7% (Claude 3.5 Haiku) to 83.6% (GPT-4o-mini), Afrikaans from 60.0% (Claude 3.5 Haiku) to 78.2% (GPT-4o-mini), and Kiswahili from 41.8% (Claude 3.5 Haiku) to 70.9% (DeepSeek). Human red-teaming increased jailbreak rates compared to automated methods. Over all evaluated languages, the average jailbreak rate increased from 59.8% to 75.8%, with improvements of +20.0% (Afrikaans), +12.7% (isiZulu), +12.3% (isiXhosa), and +1% (Kiswahili), demonstrating that poor translation quality limits jailbreak success. These findings suggest that vulnerabilities in LLMs persist in multilingual contexts and that translation quality is the critical factor determining jailbreak success in low-resource languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that multi-turn conversations using low-resource African languages (Afrikaans, Kiswahili, isiXhosa, isiZulu) can jailbreak commercial LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok) more effectively than single-turn translations. Automated testing yields English harmful response rates from 52.7% (Claude 3.5 Haiku) to 83.6% (GPT-4o-mini), with comparable rates in low-resource languages; human red-teaming by native speakers raises the average from 59.8% to 75.8%, and the authors conclude that translation quality is the critical factor determining success.

Significance. If the empirical measurements hold, the work provides concrete evidence of multilingual safety gaps in current LLMs, particularly for low-resource languages. It could motivate improved cross-lingual alignment techniques and more rigorous adversarial testing protocols that account for translation artifacts and native-speaker interactions.

major comments (2)
  1. [Methods] Methods section: the automated harmful-response classifier is not described in sufficient detail (prompt template, base model, or validation set). Without evidence that precision/recall is consistent across English, Afrikaans, Kiswahili, isiXhosa, and isiZulu (especially with code-switching or non-standard orthography), the cross-lingual rate comparisons and the claim that translation quality is the critical factor are difficult to interpret.
  2. [Results] Results and Abstract: the number of prompts, dataset size, and any statistical significance tests for the reported differences (e.g., +20.0% Afrikaans improvement from human red-teaming) are not stated. This leaves the concrete percentages (52.7%–83.6%) without clear error bars or power analysis, weakening confidence in the central empirical claims.
minor comments (2)
  1. [Abstract] Abstract: specify the exact model versions (e.g., GPT-4o-mini checkpoint) and the translation pipeline (machine translation model, post-editing steps) to allow replication.
  2. [Evaluation] Clarify whether the same harm-classification rubric was applied uniformly to both automated and human-red-teamed outputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for clarification in our manuscript. We address each major comment point by point below and indicate the revisions planned for the next version.

read point-by-point responses
  1. Referee: [Methods] Methods section: the automated harmful-response classifier is not described in sufficient detail (prompt template, base model, or validation set). Without evidence that precision/recall is consistent across English, Afrikaans, Kiswahili, isiXhosa, and isiZulu (especially with code-switching or non-standard orthography), the cross-lingual rate comparisons and the claim that translation quality is the critical factor are difficult to interpret.

    Authors: We agree that the Methods section requires substantially more detail on the automated classifier to support the cross-lingual comparisons and the attribution of performance differences to translation quality. In the revised manuscript we will add the full prompt template, the base model used for classification, the size and composition of the validation set, and any available precision/recall figures. We will also explicitly discuss limitations related to code-switching and non-standard orthography and note where language-specific validation was not performed. These additions will make the reliability of the automated rates transparent. revision: yes

  2. Referee: [Results] Results and Abstract: the number of prompts, dataset size, and any statistical significance tests for the reported differences (e.g., +20.0% Afrikaans improvement from human red-teaming) are not stated. This leaves the concrete percentages (52.7%–83.6%) without clear error bars or power analysis, weakening confidence in the central empirical claims.

    Authors: We acknowledge that the manuscript currently omits the exact number of prompts, dataset sizes, and statistical tests. In the revised version we will state the precise number of prompts and source dataset sizes for each experiment, add statistical significance tests (e.g., McNemar or chi-square tests) for the reported differences including the human red-teaming gains, and include error bars or confidence intervals on the key percentages. These changes will strengthen the presentation of the empirical results without altering the central findings. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement study

full rationale

This paper reports experimental results from translating English jailbreak prompts into low-resource African languages, querying commercial LLMs, and measuring harmful response rates via automated classifiers plus human red-teaming. No mathematical derivations, equations, fitted parameters, or predictions appear in the abstract or described methodology. Results are obtained from external model interactions and native-speaker sessions rather than any self-referential construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The study is therefore self-contained against external benchmarks with no reduction of claims to inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation paper; no mathematical free parameters, axioms, or invented entities. Claims rest on experimental protocol, prompt translation, and harm labeling.

pith-pipeline@v0.9.0 · 5812 in / 1137 out tokens · 43400 ms · 2026-05-20T10:26:25.947130+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 5 internal anchors

  1. [1]

    URL https://arxiv.org/abs/2305.06972. Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey ...

  2. [2]

    Ethical and social risks of harm from Language Models

    URL https://arxiv.org/abs/2112.04359. Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, and Summer Yue. Llm defenses are not robust to multi-turn human jailbreaks yet, 2024a. URL https://arxiv.org/abs/ 2408.15221. Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, ...

  3. [3]

    Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474, 2023

    URL https://arxiv.org/abs/2310.06474. Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. Jailbreak attacks and defenses against large language models: A survey,

  4. [4]

    Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    URL https://arxiv.org/abs/2407.04295. Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Kailong Wang. A hitchhiker’s guide to jailbreaking chatgpt via prompt engineering. InProceedings of the 4th International Workshop on Software Engineering and AI for Data Quality in Cyber-Physical Systems/Internet of Th...

  5. [5]

    do anything now

    Association for Computing Machinery. ISBN 9798400706721. doi:10.1145/3663530.3665021. URL https://doi.org/10.1145/3663530.3665021. Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on...

  6. [6]

    K., Wen, Y ., Zhang, Y ., and Yin, C

    URL https://arxiv.org/abs/2410.15236. Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. Deepinception: Hypnotize large language model to be jailbreaker, 2024c. URL https://arxiv.org/abs/2311.03191. Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack,

  7. [7]

    Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

    URL https://arxiv.org/abs/2404.01833. 8 Multilingual jailbreaking of LLMs using low-resource languagesA PREPRINT Zhenhua Wang, Wei Xie, Baosheng Wang, Enze Wang, Zhiwen Gui, Shuoyoucheng Ma, and Kai Chen. Foot in the door: Understanding large language model jailbreaking via cognitive psychology, 2024a. URL https: //arxiv.org/abs/2402.15690. Zheng-Xin Yong...

  8. [8]

    Low-Resource Languages Jailbreak GPT-4

    URL https://arxiv.org/abs/2310.02446. Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael Lyu. All languages matter: On the multilingual safety of LLMs. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 5865–5877, Bangkok, Thailan...

  9. [9]

    ISBN 979-8-89176-332-6

    Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi:10.18653/v1/2025.emnlp-main.800. URL https://aclanthology.org/2025. emnlp-main.800/. Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. The language barrier: Dissecting safety challenges of llms in multilingual...

  10. [10]

    URL http://dx.doi.org/10

    doi:10.18653/v1/2024.trustnlp-1.18. URL http://dx.doi.org/10. 18653/v1/2024.trustnlp-1.18. Jiayang Song, Yuheng Huang, Zhehua Zhou, and Lei Ma. Multilingual blending: Llm safety alignment evaluation with language mixture,

  11. [11]

    The language barrier: Dissecting safety challenges of llms in multilingual contexts

    URL https://arxiv.org/abs/2407.07342. Common Crawl. Statistics of common crawl monthly archives: Distribution of languages, April

  12. [12]

    Latest crawl: CC-MAIN-2025-13

    URL https: //commoncrawl.github.io/cc-crawl-statistics/plots/languages.html. Latest crawl: CC-MAIN-2025-13. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert,

  13. [13]

    BERTScore: Evaluating Text Generation with BERT

    URL https://arxiv.org/abs/1904.09675. Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72,