pith. machine review for the scientific record. sign in

arxiv: 2310.02446 · v2 · pith:2C5FQKYPnew · submitted 2023-10-03 · 💻 cs.CL · cs.AI· cs.CR· cs.LG

Low-Resource Languages Jailbreak GPT-4

Pith reviewed 2026-05-17 09:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CRcs.LG
keywords jailbreakinglow-resource languagesGPT-4AI safetymultilingualadversarial promptsLLM vulnerabilitiesred-teaming
0
0 comments X

The pith

Translating harmful English prompts into low-resource languages lets GPT-4 provide actionable advice for bad goals 79 percent of the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that GPT-4's safety training leaves gaps for low-resource languages because those languages appear far less often in the training data. Translating unsafe English prompts into such languages causes the model to engage and give concrete steps toward harmful goals 79 percent of the time on the AdvBenchmark. This rate matches or exceeds what specialized jailbreak attacks achieve. The same prompts translated into high- or mid-resource languages trigger refusals much more often. Because free translation tools are widely available, the weakness now threatens any user of the model rather than only people who speak low-resource languages.

Core claim

Translating unsafe English inputs into low-resource languages circumvents GPT-4's safeguards, leading the model to engage with and provide actionable items for harmful goals 79 percent of the time on the AdvBenchmark. This success rate is on par with or surpasses state-of-the-art jailbreaking attacks. High- and mid-resource languages show significantly lower attack success rates, indicating that the cross-lingual vulnerability stems mainly from limited safety training data in low-resource languages. The authors note that this deficiency, previously affecting only speakers of those languages, now creates risks for all LLM users through publicly available translation APIs.

What carries the argument

Translation of unsafe English prompts into low-resource languages to exploit imbalances in safety training data across languages.

If this is right

  • Low-resource language translations reach 79 percent attack success, matching or beating dedicated jailbreaks.
  • High- and mid-resource languages resist translated unsafe inputs at much higher rates.
  • Public translation APIs let anyone exploit the safety gap without special tools or skills.
  • Safety training must expand to cover a wide range of languages to close the vulnerability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same translation-based weakness is likely present in other large language models trained on similarly unbalanced data.
  • Red-teaming protocols should routinely test prompts after automatic translation across many languages.
  • Safety training that balances data across more languages could reduce this attack surface for future models.

Load-bearing premise

That translating an unsafe prompt into a low-resource language keeps its original harmful intent and does not trigger extra safety refusals that would appear in English.

What would settle it

Running the same harmful requests written originally in low-resource languages rather than translated from English and checking whether the compliance rate stays near 79 percent.

read the original abstract

AI safety training and red-teaming of large language models (LLMs) are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4's safeguard through translating unsafe English inputs into low-resource languages. On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. Previously, limited training on low-resource languages primarily affects speakers of those languages, causing technological disparities. However, our work highlights a crucial shift: this deficiency now poses a risk to all LLMs users. Publicly available translation APIs enable anyone to exploit LLMs' safety vulnerabilities. Therefore, our work calls for a more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that translating unsafe English prompts into low-resource languages bypasses GPT-4's safety mechanisms due to imbalanced safety training data, achieving a 79% attack success rate (ASR) on the AdvBenchmark that matches or exceeds state-of-the-art jailbreaks, while high- and mid-resource languages yield significantly lower ASRs; this exposes a cross-lingual vulnerability accessible via public translation APIs and calls for broader multilingual red-teaming.

Significance. If the central empirical result holds after addressing methodological gaps, the work is significant for highlighting a practical, low-barrier attack vector that affects all LLM users rather than only low-resource speakers, and for providing falsifiable evidence that current safety alignments are language-dependent; it strengthens the case for holistic multilingual safeguards and reproducible benchmarks in AI safety.

major comments (3)
  1. [Abstract and Experimental Results] The abstract and results section report a 79% ASR on AdvBenchmark without specifying sample size (number of prompts or languages tested), selection criteria for low-resource languages, or any statistical tests (e.g., confidence intervals or significance against baselines), which are load-bearing for the claim that this rate demonstrates a true safety bypass rather than an artifact of the experimental setup.
  2. [Methodology] The methodology does not include translation quality metrics, back-translation checks, or human fidelity ratings for the machine-translated prompts; without these, it is impossible to confirm that the unsafe intent is preserved exactly, undermining the attribution of the 79% ASR to linguistic inequality in safety training rather than semantic drift or neutralization in low-resource translations.
  3. [Results and Discussion] The comparison to state-of-the-art jailbreaking attacks lacks a direct side-by-side table or citation of the exact ASR values and prompt sets used for those baselines, making the claim that the low-resource translation approach is 'on par with or even surpassing' them difficult to evaluate rigorously.
minor comments (2)
  1. [Abstract] The abstract states that high-/mid-resource languages have 'significantly lower' ASR but provides no numerical values or reference to a table/figure showing these rates for direct comparison.
  2. [Introduction] Notation for attack success rate (ASR) and AdvBenchmark should be defined on first use with a brief description of the benchmark's composition.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for improving the clarity, reproducibility, and rigor of our manuscript. We address each major comment point by point below and commit to revisions that strengthen the presentation of our results without altering the core findings.

read point-by-point responses
  1. Referee: [Abstract and Experimental Results] The abstract and results section report a 79% ASR on AdvBenchmark without specifying sample size (number of prompts or languages tested), selection criteria for low-resource languages, or any statistical tests (e.g., confidence intervals or significance against baselines), which are load-bearing for the claim that this rate demonstrates a true safety bypass rather than an artifact of the experimental setup.

    Authors: We agree that these details are essential for rigorous evaluation and reproducibility. In the revised manuscript, we will explicitly report the sample size (number of prompts tested from AdvBenchmark), the criteria used to select the low-resource languages (based on established resource-level classifications from prior NLP literature), and include statistical measures such as confidence intervals around the 79% ASR along with significance tests relative to baselines. These additions will appear in both the abstract and the experimental results section. revision: yes

  2. Referee: [Methodology] The methodology does not include translation quality metrics, back-translation checks, or human fidelity ratings for the machine-translated prompts; without these, it is impossible to confirm that the unsafe intent is preserved exactly, undermining the attribution of the 79% ASR to linguistic inequality in safety training rather than semantic drift or neutralization in low-resource translations.

    Authors: We acknowledge that explicit verification of translation fidelity would strengthen the causal attribution to safety training imbalances. Although the experiments used publicly available translation APIs that generally preserve semantic intent, the original submission did not report quality checks. We will revise the methodology section to incorporate back-translation verification on a sampled subset of prompts, along with automated metrics (e.g., BLEU) and, where feasible, human fidelity ratings to confirm that unsafe intent remains intact. revision: yes

  3. Referee: [Results and Discussion] The comparison to state-of-the-art jailbreaking attacks lacks a direct side-by-side table or citation of the exact ASR values and prompt sets used for those baselines, making the claim that the low-resource translation approach is 'on par with or even surpassing' them difficult to evaluate rigorously.

    Authors: We will add a dedicated comparison table in the results section that directly lists our ASR alongside the exact values reported in the cited state-of-the-art jailbreaking papers, including the specific benchmarks or prompt sets employed in those works. The table will also note any differences in evaluation conditions to enable a more precise and transparent assessment of relative performance. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement

full rationale

The paper reports an empirical attack success rate (79% on AdvBenchmark) obtained by translating English unsafe prompts via public APIs and querying GPT-4. No equations, fitted parameters, predictions, or first-principles derivations are present in the provided text. The central claim is a direct observation from model responses rather than a result that reduces to its own inputs by construction, self-citation, or renaming. The analysis is self-contained against external benchmarks (model queries) and does not rely on load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests entirely on experimental observations of model behavior under translated prompts; no free parameters, mathematical axioms, or new postulated entities are introduced.

pith-pipeline@v0.9.0 · 5502 in / 1057 out tokens · 71364 ms · 2026-05-17T09:12:43.135443+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing

    cs.LG 2026-05 unverdicted novelty 8.0

    A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.

  2. Attention Is Where You Attack

    cs.CR 2026-04 unverdicted novelty 7.0

    ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.

  3. SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

    cs.CR 2026-04 unverdicted novelty 7.0

    SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26...

  4. Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

    cs.CR 2026-05 unverdicted novelty 6.0

    A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.

  5. One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

    cs.CL 2026-05 unverdicted novelty 6.0

    TurnGate uses a new multi-turn intent dataset to detect the harm-enabling closure point in dialogues, outperforming baselines with low over-refusal and generalizing across domains.

  6. One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

    cs.CL 2026-05 unverdicted novelty 6.0

    TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.

  7. Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

    cs.AI 2026-05 unverdicted novelty 6.0

    An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom hu...

  8. Multilingual Safety Alignment via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.

  9. A Theoretical Game of Attacks via Compositional Skills

    cs.CL 2026-05 unverdicted novelty 6.0

    A theoretical attacker-defender game in LLM adversarial prompting yields a best-response attack related to existing methods, reveals attacker advantages at equilibrium, and derives a provably optimal defense with stro...

  10. Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation

    cs.CR 2026-04 unverdicted novelty 6.0

    Automation-Exploit is a multi-agent LLM system that uses conditional digital-twin validation to perform risk-mitigated exploitation of logical, web, and memory-corruption vulnerabilities in black-box targets.

  11. COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

    cs.LG 2026-04 unverdicted novelty 6.0

    COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.

  12. TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

    cs.CR 2026-04 unverdicted novelty 6.0

    TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.

  13. LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

    cs.LG 2026-04 unverdicted novelty 6.0

    LASA improves LLM safety by aligning at language-agnostic semantic bottlenecks, reducing average ASR from 24.7% to 2.8% on LLaMA-3.1-8B and to 3-4% on Qwen models.

  14. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    cs.LG 2024-10 accept novelty 6.0

    AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.

  15. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    cs.CR 2024-03 accept novelty 6.0

    JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...

  16. A StrongREJECT for Empty Jailbreaks

    cs.LG 2024-02 conditional novelty 6.0

    StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.

  17. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    cs.LG 2023-10 accept novelty 6.0

    SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.

  18. Multilingual Safety Alignment via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 5.0

    MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.

  19. ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs

    cs.CR 2025-11 unverdicted novelty 5.0

    ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.

  20. Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    cs.CR 2024-07 accept novelty 4.0

    A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 18 Pith papers · 13 internal anchors

  1. [1]

    Jigsaw multilingual toxic comment classification, 2020

    Jigsaw/Conversation AI. Jigsaw multilingual toxic comment classification, 2020. https://www.kaggle.com/competitions/ jigsaw-multilingual-toxic-comment-classification , Last accessed on 2023-09-14

  2. [2]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  3. [3]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  4. [4]

    A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

    Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Love- nia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023

  5. [5]

    Building machine translation systems for the next thousand languages

    Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, et al. Building machine translation systems for the next thousand languages. arXiv preprint arXiv:2205.03983, 2022

  6. [6]

    Seamlessm4t- massively multilingual & multimodal machine translation

    Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, et al. Seamlessm4t- massively multilingual & multimodal machine translation. arXiv preprint arXiv:2308.11596, 2023

  7. [7]

    Systematic inequalities in language technology performance across the world’s languages

    Damian Blasi, Antonios Anastasopoulos, and Graham Neubig. Systematic inequalities in language technology performance across the world’s languages. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486–5505, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10....

  8. [8]

    Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023

    Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, et al. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023. 7

  9. [9]

    Aim, 2023

    Jailbreak Chat. Aim, 2023. https://www.jailbreakchat.com/prompt/ 4f37a029-9dff-4862-b323-c96a5504de5d , Last accessed on 2023-09-13

  10. [10]

    Translatorbot, 2023

    Jailbreak Chat. Translatorbot, 2023. https://www.jailbreakchat.com/prompt/ 3e93895c-2542-4201-a297-aa8be2db8bd7 , Last accessed on 2023-09-11

  11. [11]

    How is chatgpt’ s behavior changing over time?, 2023

    Lingjiao Chen, Matei Zaharia, and James Zou. How is chatgpt’ s behavior changing over time?, 2023

  12. [12]

    CONAN - COunter NArratives through nichesourcing: a multilingual dataset of responses to fight online hate speech

    Yi-Ling Chung, Elizaveta Kuzmenko, Serra Sinem Tekiroglu, and Marco Guerini. CONAN - COunter NArratives through nichesourcing: a multilingual dataset of responses to fight online hate speech. In Proceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, pages 2819–2829, Florence, Italy, July 2019. Association for Computatio...

  13. [13]

    Language support, 2023

    Google Cloud. Language support, 2023. https://cloud.google.com/translate/docs/ languages, Last accessed on 2023-09-14

  14. [14]

    No Language Left Behind: Scaling Human-Centered Machine Translation

    Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022

  15. [15]

    Multilingual Jailbreak Challenges in Large Language Models

    Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models. ArXiv, abs/2310.06474, 2023. URL https://api. semanticscholar.org/CorpusID:263831094

  16. [16]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022

  17. [17]

    Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas I. Liao, Kamil˙e Lukoši¯ut˙e, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, Dawn Drain, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jackson Kernion, Jamie Kerr, Jared Mueller, Joshua Landau, Kamal Ndousse, Karina Nguyen, Liane Lovitt, Michael Sellitto, Nelson Elhage, Noem...

  18. [18]

    RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

    Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Real- toxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020

  19. [19]

    Chatgpt perpetuates gender bias in machine translation and ignores non-gendered pronouns: Findings across bengali and five other low-resource languages

    Sourojit Ghosh and Aylin Caliskan. Chatgpt perpetuates gender bias in machine translation and ignores non-gendered pronouns: Findings across bengali and five other low-resource languages. arXiv preprint arXiv:2305.10510, 2023

  20. [20]

    How good are gpt models at machine translation? a comprehensive evaluation.arXiv preprint arXiv:2302.09210, 2023

    Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. How good are gpt models at machine translation? a comprehensive evaluation.arXiv preprint arXiv:2302.09210, 2023

  21. [21]

    Cbbq: A chinese bias benchmark dataset curated with human-ai collaboration for large language models

    Yufei Huang and Deyi Xiong. Cbbq: A chinese bias benchmark dataset curated with human-ai collaboration for large language models. arXiv preprint arXiv:2306.16244, 2023

  22. [22]

    Adversarial Examples for Evaluating Reading Comprehension Systems

    Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark, September 2017. Association for Com- putational Linguistics. doi: 10.18653/v1/D17-1215. URL https://aclanthology.org/ D17-1215. 8

  23. [23]

    Automatically auditing large language models via discrete optimization

    Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically auditing large language models via discrete optimization. In Proceedings of the 40th International Conference on Machine Learning, ICML ’23. JMLR.org, 2023

  24. [24]

    doi: 10.18653/v1/ 2024.acl-long.702

    Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , 2020. doi: 10.18653/v1/ 2020.acl-main.560

  25. [25]

    Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning.arXiv preprint arXiv:2304.05613, 2023

    Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, and Thien Huu Nguyen. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning.arXiv preprint arXiv:2304.05613, 2023

  26. [26]

    Open sesame! universal black box jailbreaking of large language models

    Raz Lapid, Ron Langberg, and Moshe Sipper. Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446, 2023

  27. [27]

    Rain: Your language models can align themselves without finetuning

    Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124, 2023

  28. [30]

    Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

    Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023

  29. [31]

    Black box adversarial prompting for foundation models

    Natalie Maus, Patrick Chao, Eric Wong, and Jacob R Gardner. Black box adversarial prompting for foundation models. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023

  30. [32]

    Mitigating harm in language models with conditional-likelihood filtration

    Helen Ngo, Cooper Raterink, João GM Araújo, Ivan Zhang, Carol Chen, Adrien Morisot, and Nicholas Frosst. Mitigating harm in language models with conditional-likelihood filtration. arXiv preprint arXiv:2108.07790, 2021

  31. [33]

    Duolingo, 2023

    OpenAI. Duolingo, 2023. https://openai.com/customer-stories/duolingo, Last ac- cessed on 2023-09-14

  32. [34]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  33. [35]

    Government of iceland, 2023

    OpenAI. Government of iceland, 2023. https://openai.com/customer-stories/ government-of-iceland, Last accessed on 2023-09-14

  34. [36]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  35. [37]

    Practical black-box attacks against machine learning

    Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Anan- thram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security , pages 506–519, 2017. 9

  36. [38]

    Red teaming language models with language models

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates, December 2022. Association for Com...

  37. [39]

    Square one bias in NLP: Towards a multi-dimensional exploration of the research manifold

    Sebastian Ruder, Ivan Vuli´ c, and Anders Søgaard. Square one bias in NLP: Towards a multi-dimensional exploration of the research manifold. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2340–2354, Dublin, Ireland, May 2022. As- sociation for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.184. URL https://acl...

  38. [40]

    Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models

    Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, et al. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv preprint arXiv:2308.16149, 2023

  39. [41]

    "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023

  40. [42]

    Why so toxic? measuring and triggering toxic behavior in open-domain chatbots

    Wai Man Si, Michael Backes, Jeremy Blackburn, Emiliano De Cristofaro, Gianluca Stringhini, Savvas Zannettou, and Yang Zhang. Why so toxic? measuring and triggering toxic behavior in open-domain chatbots. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 2659–2673, 2022

  41. [43]

    Universal adversarial attacks with natural triggers for text classification

    Liwei Song, Xinwei Yu, Hsuan-Tung Peng, and Karthik Narasimhan. Universal adversarial attacks with natural triggers for text classification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3724–3733, Online, June 2021. Association for Computational Lin...

  42. [44]

    Smith, and Luke Zettlemoyer

    Gabriel Stanovsky, Noah A. Smith, and Luke Zettlemoyer. Evaluating gender bias in machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1679–1684, Florence, Italy, July 2019. Association for Computational Lin- guistics. doi: 10.18653/v1/P19-1164. URL https://aclanthology.org/P19-1164

  43. [45]

    ChatGPT is not a good indigenous translator

    David Stap and Ali Araabi. ChatGPT is not a good indigenous translator. In Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP), pages 163–167, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.americasnlp-1.17. URL https://aclanthology.org/ 2023.americ...

  44. [46]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  45. [47]

    Translated unleashes full gpt-4 potential for businesses operating in languages other than english, 2023

    Translated. Translated unleashes full gpt-4 potential for businesses operating in languages other than english, 2023. https://translated.com/t-lm-gpt-integration , Last ac- cessed on 2023-09-14

  46. [48]

    Universal adversarial triggers for attacking and analyzing NLP

    Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 2153–2162, Hong Kong, China, November

  47. [49]

    doi: 10.18653/v1/D19-1221

    Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https: //aclanthology.org/D19-1221

  48. [50]

    Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen tse Huang, Wenxiang Jiao, and Michael R. Lyu. All languages matter: On the multilingual safety of large language mod- els. ArXiv, abs/2310.00905, 2023. URL https://api.semanticscholar.org/CorpusID: 263605466. 10

  49. [51]

    Do-not-answer: A dataset for evaluating safeguards in llms

    Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387, 2023

  50. [52]

    Jailbroken: How Does LLM Safety Training Fail?

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023

  51. [53]

    Ethical and social risks of harm from Language Models

    Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021

  52. [54]

    Prompting large language models to generate code-mixed texts: The case of south east asian languages

    Zheng-Xin Yong, Ruochen Zhang, Jessica Zosa Forde, Skyler Wang, Samuel Cahyawijaya, Holy Lovenia, Lintang Sutawika, Jan Christian Blaise Cruz, Long Phan, Yin Lin Tan, et al. Prompting large language models to generate code-mixed texts: The case of south east asian languages. arXiv preprint arXiv:2303.13592, 2023

  53. [55]

    Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher

    Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023

  54. [56]

    Exploring ai ethics of chatgpt: A diagnostic analysis

    Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867, pages 12–2, 2023

  55. [57]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adver- sarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 11 A Language resource settings classification We classify the resource setting of a language using the taxonomy provided by Joshi et al. [24]. • Low-Resource: Languages that are cons...