arxiv: 2310.02446 · v2 · pith:2C5FQKYPnew · submitted 2023-10-03 · 💻 cs.CL · cs.AI· cs.CR· cs.LG

Low-Resource Languages Jailbreak GPT-4

Zheng-Xin Yong , Cristina Menghini , Stephen H. Bach This is my paper

Pith reviewed 2026-05-17 09:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CRcs.LG

keywords jailbreakinglow-resource languagesGPT-4AI safetymultilingualadversarial promptsLLM vulnerabilitiesred-teaming

0 comments

The pith

Translating harmful English prompts into low-resource languages lets GPT-4 provide actionable advice for bad goals 79 percent of the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that GPT-4's safety training leaves gaps for low-resource languages because those languages appear far less often in the training data. Translating unsafe English prompts into such languages causes the model to engage and give concrete steps toward harmful goals 79 percent of the time on the AdvBenchmark. This rate matches or exceeds what specialized jailbreak attacks achieve. The same prompts translated into high- or mid-resource languages trigger refusals much more often. Because free translation tools are widely available, the weakness now threatens any user of the model rather than only people who speak low-resource languages.

Core claim

Translating unsafe English inputs into low-resource languages circumvents GPT-4's safeguards, leading the model to engage with and provide actionable items for harmful goals 79 percent of the time on the AdvBenchmark. This success rate is on par with or surpasses state-of-the-art jailbreaking attacks. High- and mid-resource languages show significantly lower attack success rates, indicating that the cross-lingual vulnerability stems mainly from limited safety training data in low-resource languages. The authors note that this deficiency, previously affecting only speakers of those languages, now creates risks for all LLM users through publicly available translation APIs.

What carries the argument

Translation of unsafe English prompts into low-resource languages to exploit imbalances in safety training data across languages.

If this is right

Low-resource language translations reach 79 percent attack success, matching or beating dedicated jailbreaks.
High- and mid-resource languages resist translated unsafe inputs at much higher rates.
Public translation APIs let anyone exploit the safety gap without special tools or skills.
Safety training must expand to cover a wide range of languages to close the vulnerability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same translation-based weakness is likely present in other large language models trained on similarly unbalanced data.
Red-teaming protocols should routinely test prompts after automatic translation across many languages.
Safety training that balances data across more languages could reduce this attack surface for future models.

Load-bearing premise

That translating an unsafe prompt into a low-resource language keeps its original harmful intent and does not trigger extra safety refusals that would appear in English.

What would settle it

Running the same harmful requests written originally in low-resource languages rather than translated from English and checking whether the compliance rate stays near 79 percent.

read the original abstract

AI safety training and red-teaming of large language models (LLMs) are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4's safeguard through translating unsafe English inputs into low-resource languages. On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. Previously, limited training on low-resource languages primarily affects speakers of those languages, causing technological disparities. However, our work highlights a crucial shift: this deficiency now poses a risk to all LLMs users. Publicly available translation APIs enable anyone to exploit LLMs' safety vulnerabilities. Therefore, our work calls for a more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Translating unsafe prompts into low-resource languages bypasses GPT-4 safety at 79% success, but without translation fidelity checks the result could reflect prompt degradation instead.

read the letter

The key point from this paper is that low-resource language translations serve as an effective jailbreak for GPT-4, with a reported 79% attack success rate on the AdvBenchmark that holds up against or exceeds some existing methods. What is new is the direct link between the data imbalance in safety training for low-resource languages and the ability to elicit harmful responses. The authors show that high- and mid-resource languages do not work nearly as well for this, which points to training data coverage as the main driver rather than general model weaknesses. They also emphasize that public translation tools make this accessible to anyone, turning a niche issue into a broad safety concern. The experiment itself is clean in concept. They start with English unsafe prompts, translate them, query the model, and measure engagement with actionable harmful content. This setup is easy to replicate and focuses on a real deployment risk. The soft spots are in the missing details around the translations. Without reported checks like back-translation accuracy or human evaluation of intent preservation, it's possible that the higher success rate comes from degraded prompt quality rather than a true failure of cross-lingual safety. The abstract and available info do not address statistical significance or exact prompt counts either, which makes the 79% figure harder to interpret precisely. This work is for researchers in AI safety who are thinking about multilingual robustness and red-teaming strategies. Anyone evaluating LLM safeguards for broad language support would find the results relevant and the call for wider coverage useful. I would recommend sending it to peer review. The finding identifies a practical vulnerability that needs attention, and referees can help tighten the methodology around translation fidelity and evaluation criteria.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that translating unsafe English prompts into low-resource languages bypasses GPT-4's safety mechanisms due to imbalanced safety training data, achieving a 79% attack success rate (ASR) on the AdvBenchmark that matches or exceeds state-of-the-art jailbreaks, while high- and mid-resource languages yield significantly lower ASRs; this exposes a cross-lingual vulnerability accessible via public translation APIs and calls for broader multilingual red-teaming.

Significance. If the central empirical result holds after addressing methodological gaps, the work is significant for highlighting a practical, low-barrier attack vector that affects all LLM users rather than only low-resource speakers, and for providing falsifiable evidence that current safety alignments are language-dependent; it strengthens the case for holistic multilingual safeguards and reproducible benchmarks in AI safety.

major comments (3)

[Abstract and Experimental Results] The abstract and results section report a 79% ASR on AdvBenchmark without specifying sample size (number of prompts or languages tested), selection criteria for low-resource languages, or any statistical tests (e.g., confidence intervals or significance against baselines), which are load-bearing for the claim that this rate demonstrates a true safety bypass rather than an artifact of the experimental setup.
[Methodology] The methodology does not include translation quality metrics, back-translation checks, or human fidelity ratings for the machine-translated prompts; without these, it is impossible to confirm that the unsafe intent is preserved exactly, undermining the attribution of the 79% ASR to linguistic inequality in safety training rather than semantic drift or neutralization in low-resource translations.
[Results and Discussion] The comparison to state-of-the-art jailbreaking attacks lacks a direct side-by-side table or citation of the exact ASR values and prompt sets used for those baselines, making the claim that the low-resource translation approach is 'on par with or even surpassing' them difficult to evaluate rigorously.

minor comments (2)

[Abstract] The abstract states that high-/mid-resource languages have 'significantly lower' ASR but provides no numerical values or reference to a table/figure showing these rates for direct comparison.
[Introduction] Notation for attack success rate (ASR) and AdvBenchmark should be defined on first use with a brief description of the benchmark's composition.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for improving the clarity, reproducibility, and rigor of our manuscript. We address each major comment point by point below and commit to revisions that strengthen the presentation of our results without altering the core findings.

read point-by-point responses

Referee: [Abstract and Experimental Results] The abstract and results section report a 79% ASR on AdvBenchmark without specifying sample size (number of prompts or languages tested), selection criteria for low-resource languages, or any statistical tests (e.g., confidence intervals or significance against baselines), which are load-bearing for the claim that this rate demonstrates a true safety bypass rather than an artifact of the experimental setup.

Authors: We agree that these details are essential for rigorous evaluation and reproducibility. In the revised manuscript, we will explicitly report the sample size (number of prompts tested from AdvBenchmark), the criteria used to select the low-resource languages (based on established resource-level classifications from prior NLP literature), and include statistical measures such as confidence intervals around the 79% ASR along with significance tests relative to baselines. These additions will appear in both the abstract and the experimental results section. revision: yes
Referee: [Methodology] The methodology does not include translation quality metrics, back-translation checks, or human fidelity ratings for the machine-translated prompts; without these, it is impossible to confirm that the unsafe intent is preserved exactly, undermining the attribution of the 79% ASR to linguistic inequality in safety training rather than semantic drift or neutralization in low-resource translations.

Authors: We acknowledge that explicit verification of translation fidelity would strengthen the causal attribution to safety training imbalances. Although the experiments used publicly available translation APIs that generally preserve semantic intent, the original submission did not report quality checks. We will revise the methodology section to incorporate back-translation verification on a sampled subset of prompts, along with automated metrics (e.g., BLEU) and, where feasible, human fidelity ratings to confirm that unsafe intent remains intact. revision: yes
Referee: [Results and Discussion] The comparison to state-of-the-art jailbreaking attacks lacks a direct side-by-side table or citation of the exact ASR values and prompt sets used for those baselines, making the claim that the low-resource translation approach is 'on par with or even surpassing' them difficult to evaluate rigorously.

Authors: We will add a dedicated comparison table in the results section that directly lists our ASR alongside the exact values reported in the cited state-of-the-art jailbreaking papers, including the specific benchmarks or prompt sets employed in those works. The table will also note any differences in evaluation conditions to enable a more precise and transparent assessment of relative performance. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement

full rationale

The paper reports an empirical attack success rate (79% on AdvBenchmark) obtained by translating English unsafe prompts via public APIs and querying GPT-4. No equations, fitted parameters, predictions, or first-principles derivations are present in the provided text. The central claim is a direct observation from model responses rather than a result that reduces to its own inputs by construction, self-citation, or renaming. The analysis is self-contained against external benchmarks (model queries) and does not rely on load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests entirely on experimental observations of model behavior under translated prompts; no free parameters, mathematical axioms, or new postulated entities are introduced.

pith-pipeline@v0.9.0 · 5502 in / 1057 out tokens · 71364 ms · 2026-05-17T09:12:43.135443+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing
cs.LG 2026-05 unverdicted novelty 8.0

A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.
Attention Is Where You Attack
cs.CR 2026-04 unverdicted novelty 7.0

ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
cs.CR 2026-04 unverdicted novelty 7.0

SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26...
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs
cs.CR 2026-05 unverdicted novelty 6.0

A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
cs.CL 2026-05 unverdicted novelty 6.0

TurnGate uses a new multi-turn intent dataset to detect the harm-enabling closure point in dialogues, outperforming baselines with low over-refusal and generalizing across domains.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
cs.CL 2026-05 unverdicted novelty 6.0

TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
cs.AI 2026-05 unverdicted novelty 6.0

An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom hu...
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
A Theoretical Game of Attacks via Compositional Skills
cs.CL 2026-05 unverdicted novelty 6.0

A theoretical attacker-defender game in LLM adversarial prompting yields a best-response attack related to existing methods, reveals attacker advantages at equilibrium, and derives a provably optimal defense with stro...
Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation
cs.CR 2026-04 unverdicted novelty 6.0

Automation-Exploit is a multi-agent LLM system that uses conditional digital-twin validation to perform risk-mitigated exploitation of logical, web, and memory-corruption vulnerabilities in black-box targets.
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
cs.LG 2026-04 unverdicted novelty 6.0

COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs
cs.CR 2026-04 unverdicted novelty 6.0

TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety
cs.LG 2026-04 unverdicted novelty 6.0

LASA improves LLM safety by aligning at language-agnostic semantic bottlenecks, reducing average ASR from 24.7% to 2.8% on LLaMA-3.1-8B and to 3-4% on Qwen models.
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
cs.LG 2024-10 accept novelty 6.0

AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
cs.CR 2024-03 accept novelty 6.0

JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
A StrongREJECT for Empty Jailbreaks
cs.LG 2024-02 conditional novelty 6.0

StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
cs.LG 2023-10 accept novelty 6.0

SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 5.0

MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs
cs.CR 2025-11 unverdicted novelty 5.0

ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
cs.CR 2024-07 accept novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 18 Pith papers · 13 internal anchors

[1]

Jigsaw multilingual toxic comment classification, 2020

Jigsaw/Conversation AI. Jigsaw multilingual toxic comment classification, 2020. https://www.kaggle.com/competitions/ jigsaw-multilingual-toxic-comment-classification , Last accessed on 2023-09-14

work page 2020
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Love- nia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023

work page internal anchor Pith review arXiv 2023
[5]

Building machine translation systems for the next thousand languages

Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, et al. Building machine translation systems for the next thousand languages. arXiv preprint arXiv:2205.03983, 2022

work page arXiv 2022
[6]

Seamlessm4t- massively multilingual & multimodal machine translation

Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, et al. Seamlessm4t- massively multilingual & multimodal machine translation. arXiv preprint arXiv:2308.11596, 2023

work page arXiv 2023
[7]

Systematic inequalities in language technology performance across the world’s languages

Damian Blasi, Antonios Anastasopoulos, and Graham Neubig. Systematic inequalities in language technology performance across the world’s languages. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486–5505, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10....

work page doi:10.18653/v1/2022.acl-long.376 2022
[8]

Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023

Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, et al. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023. 7

work page arXiv 2023
[9]

Aim, 2023

Jailbreak Chat. Aim, 2023. https://www.jailbreakchat.com/prompt/ 4f37a029-9dff-4862-b323-c96a5504de5d , Last accessed on 2023-09-13

work page 2023
[10]

Translatorbot, 2023

Jailbreak Chat. Translatorbot, 2023. https://www.jailbreakchat.com/prompt/ 3e93895c-2542-4201-a297-aa8be2db8bd7 , Last accessed on 2023-09-11

work page 2023
[11]

How is chatgpt’ s behavior changing over time?, 2023

Lingjiao Chen, Matei Zaharia, and James Zou. How is chatgpt’ s behavior changing over time?, 2023

work page 2023
[12]

CONAN - COunter NArratives through nichesourcing: a multilingual dataset of responses to fight online hate speech

Yi-Ling Chung, Elizaveta Kuzmenko, Serra Sinem Tekiroglu, and Marco Guerini. CONAN - COunter NArratives through nichesourcing: a multilingual dataset of responses to fight online hate speech. In Proceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, pages 2819–2829, Florence, Italy, July 2019. Association for Computatio...

work page doi:10.18653/v1/p19-1271 2019
[13]

Language support, 2023

Google Cloud. Language support, 2023. https://cloud.google.com/translate/docs/ languages, Last accessed on 2023-09-14

work page 2023
[14]

No Language Left Behind: Scaling Human-Centered Machine Translation

Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Multilingual Jailbreak Challenges in Large Language Models

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models. ArXiv, abs/2310.06474, 2023. URL https://api. semanticscholar.org/CorpusID:263831094

work page arXiv 2023
[16]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas I. Liao, Kamil˙e Lukoši¯ut˙e, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, Dawn Drain, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jackson Kernion, Jamie Kerr, Jared Mueller, Joshua Landau, Kamal Ndousse, Karina Nguyen, Liane Lovitt, Michael Sellitto, Nelson Elhage, Noem...

work page arXiv 2023
[18]

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Real- toxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[19]

Chatgpt perpetuates gender bias in machine translation and ignores non-gendered pronouns: Findings across bengali and five other low-resource languages

Sourojit Ghosh and Aylin Caliskan. Chatgpt perpetuates gender bias in machine translation and ignores non-gendered pronouns: Findings across bengali and five other low-resource languages. arXiv preprint arXiv:2305.10510, 2023

work page arXiv 2023
[20]

How good are gpt models at machine translation? a comprehensive evaluation.arXiv preprint arXiv:2302.09210, 2023

Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. How good are gpt models at machine translation? a comprehensive evaluation.arXiv preprint arXiv:2302.09210, 2023

work page arXiv 2023
[21]

Cbbq: A chinese bias benchmark dataset curated with human-ai collaboration for large language models

Yufei Huang and Deyi Xiong. Cbbq: A chinese bias benchmark dataset curated with human-ai collaboration for large language models. arXiv preprint arXiv:2306.16244, 2023

work page arXiv 2023
[22]

Adversarial Examples for Evaluating Reading Comprehension Systems

Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark, September 2017. Association for Com- putational Linguistics. doi: 10.18653/v1/D17-1215. URL https://aclanthology.org/ D17-1215. 8

work page doi:10.18653/v1/d17-1215 2017
[23]

Automatically auditing large language models via discrete optimization

Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically auditing large language models via discrete optimization. In Proceedings of the 40th International Conference on Machine Learning, ICML ’23. JMLR.org, 2023

work page 2023
[24]

doi: 10.18653/v1/ 2024.acl-long.702

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , 2020. doi: 10.18653/v1/ 2020.acl-main.560

work page doi:10.18653/v1/ 2020
[25]

Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning.arXiv preprint arXiv:2304.05613, 2023

Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, and Thien Huu Nguyen. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning.arXiv preprint arXiv:2304.05613, 2023

work page arXiv 2023
[26]

Open sesame! universal black box jailbreaking of large language models

Raz Lapid, Ron Langberg, and Moshe Sipper. Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446, 2023

work page arXiv 2023
[27]

Rain: Your language models can align themselves without finetuning

Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124, 2023

work page arXiv 2023
[30]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Black box adversarial prompting for foundation models

Natalie Maus, Patrick Chao, Eric Wong, and Jacob R Gardner. Black box adversarial prompting for foundation models. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023

work page 2023
[32]

Mitigating harm in language models with conditional-likelihood filtration

Helen Ngo, Cooper Raterink, João GM Araújo, Ivan Zhang, Carol Chen, Adrien Morisot, and Nicholas Frosst. Mitigating harm in language models with conditional-likelihood filtration. arXiv preprint arXiv:2108.07790, 2021

work page arXiv 2021
[33]

Duolingo, 2023

OpenAI. Duolingo, 2023. https://openai.com/customer-stories/duolingo, Last ac- cessed on 2023-09-14

work page 2023
[34]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Government of iceland, 2023

OpenAI. Government of iceland, 2023. https://openai.com/customer-stories/ government-of-iceland, Last accessed on 2023-09-14

work page 2023
[36]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022
[37]

Practical black-box attacks against machine learning

Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Anan- thram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security , pages 506–519, 2017. 9

work page 2017
[38]

Red teaming language models with language models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates, December 2022. Association for Com...

work page doi:10.18653/v1/2022.emnlp-main.225 2022
[39]

Square one bias in NLP: Towards a multi-dimensional exploration of the research manifold

Sebastian Ruder, Ivan Vuli´ c, and Anders Søgaard. Square one bias in NLP: Towards a multi-dimensional exploration of the research manifold. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2340–2354, Dublin, Ireland, May 2022. As- sociation for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.184. URL https://acl...

work page doi:10.18653/v1/2022.findings-acl.184 2022
[40]

Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models

Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, et al. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv preprint arXiv:2308.16149, 2023

work page arXiv 2023
[41]

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Why so toxic? measuring and triggering toxic behavior in open-domain chatbots

Wai Man Si, Michael Backes, Jeremy Blackburn, Emiliano De Cristofaro, Gianluca Stringhini, Savvas Zannettou, and Yang Zhang. Why so toxic? measuring and triggering toxic behavior in open-domain chatbots. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 2659–2673, 2022

work page 2022
[43]

Universal adversarial attacks with natural triggers for text classification

Liwei Song, Xinwei Yu, Hsuan-Tung Peng, and Karthik Narasimhan. Universal adversarial attacks with natural triggers for text classification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3724–3733, Online, June 2021. Association for Computational Lin...

work page doi:10.18653/v1/2021.naacl-main.291 2021
[44]

Smith, and Luke Zettlemoyer

Gabriel Stanovsky, Noah A. Smith, and Luke Zettlemoyer. Evaluating gender bias in machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1679–1684, Florence, Italy, July 2019. Association for Computational Lin- guistics. doi: 10.18653/v1/P19-1164. URL https://aclanthology.org/P19-1164

work page doi:10.18653/v1/p19-1164 2019
[45]

ChatGPT is not a good indigenous translator

David Stap and Ali Araabi. ChatGPT is not a good indigenous translator. In Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP), pages 163–167, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.americasnlp-1.17. URL https://aclanthology.org/ 2023.americ...

work page doi:10.18653/v1/2023.americasnlp-1.17 2023
[46]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Translated unleashes full gpt-4 potential for businesses operating in languages other than english, 2023

Translated. Translated unleashes full gpt-4 potential for businesses operating in languages other than english, 2023. https://translated.com/t-lm-gpt-integration , Last ac- cessed on 2023-09-14

work page 2023
[48]

Universal adversarial triggers for attacking and analyzing NLP

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 2153–2162, Hong Kong, China, November

work page 2019
[49]

doi: 10.18653/v1/D19-1221

Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https: //aclanthology.org/D19-1221

work page doi:10.18653/v1/d19-1221
[50]

Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen tse Huang, Wenxiang Jiao, and Michael R. Lyu. All languages matter: On the multilingual safety of large language mod- els. ArXiv, abs/2310.00905, 2023. URL https://api.semanticscholar.org/CorpusID: 263605466. 10

work page arXiv 2023
[51]

Do-not-answer: A dataset for evaluating safeguards in llms

Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387, 2023

work page arXiv 2023
[52]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Ethical and social risks of harm from Language Models

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[54]

Prompting large language models to generate code-mixed texts: The case of south east asian languages

Zheng-Xin Yong, Ruochen Zhang, Jessica Zosa Forde, Skyler Wang, Samuel Cahyawijaya, Holy Lovenia, Lintang Sutawika, Jan Christian Blaise Cruz, Long Phan, Yin Lin Tan, et al. Prompting large language models to generate code-mixed texts: The case of south east asian languages. arXiv preprint arXiv:2303.13592, 2023

work page arXiv 2023
[55]

Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023

work page arXiv 2023
[56]

Exploring ai ethics of chatgpt: A diagnostic analysis

Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867, pages 12–2, 2023

work page arXiv 2023
[57]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adver- sarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 11 A Language resource settings classification We classify the resource setting of a language using the taxonomy provided by Joshi et al. [24]. • Low-Resource: Languages that are cons...

work page internal anchor Pith review Pith/arXiv arXiv 2023