Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis
Pith reviewed 2026-06-27 16:47 UTC · model grok-4.3
The pith
Direct translations of safety benchmarks underestimate LLM attack risks in Korean, Japanese, Thai and Khmer by an average of 9.3 percentage points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Culturally-adapted prompts produce positive Delta-ASR across all 16 language-model combinations (mean +9.3 pp) and direct-translation evaluation underestimates risk in 44 of 48 category-language combinations; direct-translation cultural depth scores average 0.17 out of 3.0 while adapted scores reach 2.51, showing that translation alone yields inputs systematically unlike those arising in real multicultural settings.
What carries the argument
Paired DT and CA datasets created by 1:1 seed matching, which allow head-to-head measurement of attack success rate and cultural realism across languages.
If this is right
- Direct-translation benchmarks systematically underestimate model risk in non-English languages.
- Threat-form distributions differ across languages, so a single translated template set cannot cover all relevant cases.
- Valid multilingual safety evaluation requires explicit adaptation to local social norms and legal frameworks rather than surface-level translation.
Where Pith is reading between the lines
- Safety testing pipelines that rely only on English-to-target translation will continue to miss context-dependent vulnerabilities until cultural adaptation steps are added.
- The observed heterogeneity in threat forms suggests that language-specific red-teaming taxonomies may be needed beyond the four languages studied here.
- Model developers could use the same seed-matching approach to generate training data that better reflects real-world multicultural usage.
Load-bearing premise
The 1:1 seed matching procedure accurately captures language-specific cultural contexts, threat forms, social norms, and legal frameworks without systematic bias or loss of validity.
What would settle it
An experiment in which culturally-adapted prompts fail to raise attack success rates above direct-translation baselines or in which direct-translation cultural depth scores equal those of adapted prompts.
Figures
read the original abstract
Multilingual safety evaluation of large language models (LLMs) has predominantly relied on direct translation (DT) of English benchmarks into target languages - an approach that converts surface-level linguistic form while failing to reflect the cultural context embedded in threat scenarios, social norms, and legal frameworks. We construct paired DT and culturally-adapted (CA) datasets via 1:1 seed matching for four languages - Korean (KO), Japanese (JA), Thai (TH), and Khmer (KM) - and compare Attack Success Rate (ASR) and Cultural Realism scores across four open-source LLM. CA prompts yield Delta-ASR > 0 across all 16 language x model combinations (mean +9.3 pp), and DT-based evaluation underestimates risk in 44 of 48 category x language combinations. Language-level analysis reveals that the distribution of threat forms is heterogeneous across languages. Cultural Realism analysis further shows that DT Cultural Depth (C3) scores remain consistently below 1.0 out of 3.0 across all four languages (mean 0.17), whereas CA scores reach up to 2.51, indicating that direct translation produces inputs systematically divergent from those encountered in real-world multicultural settings. These findings demonstrate that adapting benchmarks to language-specific cultural contexts - rather than relying on linguistic translation alone - is necessary for valid multilingual LLM safety evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that direct translation (DT) of English red-teaming benchmarks fails to capture cultural contexts in Korean, Japanese, Thai, and Khmer, leading to underestimation of LLM risks. Using 1:1 seed matching to create paired culturally-adapted (CA) datasets, it reports Delta-ASR > 0 across all 16 language-model combinations (mean +9.3 pp), DT underestimation in 44/48 category-language cases, and Cultural Realism (C3) scores of mean 0.17 for DT versus up to 2.51 for CA, concluding that cultural adaptation is necessary for valid multilingual safety evaluation.
Significance. If the CA construction is validated, the result would be significant for multilingual LLM safety research by providing concrete evidence that DT methods produce systematically divergent inputs and lower measured risk. The cross-language comparison and introduction of Cultural Realism scoring could influence benchmark design practices.
major comments (3)
- [Methods (1:1 seed matching procedure)] Dataset construction via 1:1 seed matching: No validation is reported for cultural fidelity (e.g., inter-annotator agreement, native-speaker blind assessment, or bias checks against over/under-adaptation). This is load-bearing for the central claim, as the observed ASR gaps and C3 differences could reflect construction artifacts rather than genuine language-specific norms, threat forms, or legal contexts.
- [Abstract and Results] Abstract and results reporting: Aggregate statistics (Delta-ASR mean +9.3 pp, 44/48 underestimation) are presented without dataset sizes, category definitions, statistical tests, error bars, model versions, or inter-rater reliability for Cultural Realism scores. These details are required to evaluate whether the headline comparisons are robust.
- [Evaluation Metrics] Cultural Realism (C3) metric: The scale (0-3.0), rubric, and application protocol across languages are not described, undermining the claim that DT scores (mean 0.17) indicate inputs 'systematically divergent from those encountered in real-world multicultural settings.'
minor comments (1)
- [Abstract] Abstract: 'four open-source LLM' should read 'four open-source LLMs'.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below with point-by-point responses and indicate planned revisions.
read point-by-point responses
-
Referee: [Methods (1:1 seed matching procedure)] Dataset construction via 1:1 seed matching: No validation is reported for cultural fidelity (e.g., inter-annotator agreement, native-speaker blind assessment, or bias checks against over/under-adaptation). This is load-bearing for the central claim, as the observed ASR gaps and C3 differences could reflect construction artifacts rather than genuine language-specific norms, threat forms, or legal contexts.
Authors: The 1:1 seed matching was executed by native-speaker teams for each language using a shared protocol to incorporate local norms and contexts. The manuscript does not report inter-annotator agreement or blind assessments. We will expand the Methods section with a full description of the protocol and any bias checks performed during construction, and we will explicitly note the absence of formal inter-rater metrics as a limitation. revision: yes
-
Referee: [Abstract and Results] Abstract and results reporting: Aggregate statistics (Delta-ASR mean +9.3 pp, 44/48 underestimation) are presented without dataset sizes, category definitions, statistical tests, error bars, model versions, or inter-rater reliability for Cultural Realism scores. These details are required to evaluate whether the headline comparisons are robust.
Authors: Dataset sizes, category definitions, and model versions appear in Sections 3 and 4 and Table 1. Statistical tests for ASR differences are reported in the results. Error bars and inter-rater reliability for C3 scores are not included. We will add error bars to aggregate statistics and report inter-rater reliability (or its absence) in the revised results section. revision: yes
-
Referee: [Evaluation Metrics] Cultural Realism (C3) metric: The scale (0-3.0), rubric, and application protocol across languages are not described, undermining the claim that DT scores (mean 0.17) indicate inputs 'systematically divergent from those encountered in real-world multicultural settings.'
Authors: The 0-3.0 scale, rubric, and native-speaker rating protocol are defined in Section 4.3 and Appendix B. We will elevate the rubric and protocol description into the main text to make the metric fully transparent and to support the interpretation of the DT versus CA scores. revision: yes
Circularity Check
No circularity: purely empirical dataset comparison
full rationale
The paper reports direct empirical comparisons of ASR and Cultural Realism between DT and CA datasets constructed via 1:1 seed matching. No equations, fitted parameters, predictions derived from fits, or load-bearing self-citations appear in the provided text. The central claims (Delta-ASR > 0, DT underestimation in 44/48 cases) are presented as observed outcomes rather than derivations that reduce to the construction method by definition. The 1:1 matching procedure is an input assumption whose validity is external to any circular reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption 1:1 seed matching produces culturally valid and comparable threat scenarios across languages
- domain assumption Cultural Realism scores (0-3 scale) validly measure divergence from real-world multicultural inputs
Reference graph
Works this paper leans on
-
[1]
Constitutional AI: Harmlessness from AI Feedback
Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
J., and Bing, L
Deng, Y ., Zhang, W., Pan, S. J., and Bing, L. Multilingual jailbreak challenges in large language models. InInterna- tional Conference on Learning Representations, volume 2024, pp. 24634–24651,
2024
-
[3]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y ., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Kim, C., Lim, Y ., Kim, K., Kim, J., and Kim, M. Cage: A framework for culturally adaptive red-teaming benchmark generation.arXiv preprint arXiv:2602.20170,
-
[6]
Exaone 3.5: Series of large language models for real-world use cases,
Technical report; arXiv:2412.04862. Li, L., Dong, B., Wang, R., Hu, X., Zuo, W., Lin, D., Qiao, Y ., and Shao, J. SALAD-Bench: A hierarchical and comprehensive safety benchmark for large language mod- els. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 3923–3954,
-
[7]
M., Miranda, L
Lovenia, H., Mahendra, R., Akbar, S. M., Miranda, L. J. V ., Santoso, J., Aco, E., Fadhilah, A., Mansurov, J., Imperial, J. M., Kampman, O. P., et al. Seacrowd: A multilingual multimodal data hub and benchmark suite for southeast asian languages. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 5155–5203,
2024
-
[8]
Red teaming language models with language models
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving, G. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3419–3448,
2022
-
[9]
Technical report; arXiv:2412.15115. Ranathunga, S., Lee, E.-S. A., Prifti Skenduli, M., Shekhar, R., Alam, M., and Kaur, R. Neural machine translation for low-resource languages: A survey.ACM Computing Surveys, 55(11):1–37,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Röttger, P., Vidgen, B., Nguyen, D., Waseem, Z., Margetts, H., and Pierrehumbert, J. B. Hatecheck: Functional tests for hate speech detection models. InACL-IJCNLP 2021,
2021
-
[11]
A., and Yang, D
Shi, W., Li, R., Zhang, Y ., Ziems, C., Yu, S., Horesh, R., De Paula, R. A., and Yang, D. Culturebank: An on- line community-driven knowledge base towards culturally aware language technologies. InFindings of the Associ- ation for Computational Linguistics: EMNLP 2024, pp. 4996–5025,
2024
-
[12]
Safety assessment of Chinese large language models
Sun, H., Zhang, Z., Deng, J., Cheng, J., and Huang, M. Safety assessment of Chinese large language models. arXiv preprint arXiv:2304.10436,
-
[13]
Alert: A comprehensive benchmark for assessing large language models’ safety through red teaming,
Tedeschi, S., Friedrich, F., Schramowski, P., Kersting, K., Navigli, R., Nguyen, H., and Li, B. ALERT: A comprehensive benchmark for assessing large language models’ safety through red teaming.arXiv preprint arXiv:2404.08676,
-
[14]
Wang, W., Tu, Z., Chen, C., Yuan, Y ., Huang, J.-t., Jiao, W., and Lyu, M. R. All languages matter: On the multi- lingual safety of large language models.arXiv preprint arXiv:2310.00905, 2023a. 7 Culturally-Adapted Red-Teaming in East and Southeast Asia Wang, Y ., Li, H., Han, X., Nakov, P., and Baldwin, T. Do- not-answer: A dataset for evaluating safegua...
-
[15]
Yong, Z.-X., Menghini, C., and Bach, S. H. Low- resource languages jailbreak GPT-4.arXiv preprint arXiv:2310.02446,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
The state of multilingual llm safety research: From measuring the language gap to mitigating it
Yong, Z.-X., Ermis, B., Fadaee, M., Bach, S., and Kreutzer, J. The state of multilingual llm safety research: From measuring the language gap to mitigating it. InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 15856–15871,
2025
-
[17]
Air-bench 2024: A safety benchmark based on risk categories from regulations and policies,
Zeng, Y ., Yang, Y ., Zhou, A., Tan, J. Z., Tu, Y ., Mai, Y ., Kly- man, K., Pan, M., Jia, R., Song, D., et al. Air-bench 2024: A safety benchmark based on risk categories from regu- lations and policies.arXiv preprint arXiv:2407.17436,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.