pith. sign in

arxiv: 2606.09178 · v2 · pith:TDOUET7Pnew · submitted 2026-06-08 · 💻 cs.CL · cs.AI

Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis

Pith reviewed 2026-06-27 16:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multilingual LLM safetycultural adaptationred-teamingattack success ratedirect translationEast Asian languagesSoutheast Asian languages
0
0 comments X

The pith

Direct translations of safety benchmarks underestimate LLM attack risks in Korean, Japanese, Thai and Khmer by an average of 9.3 percentage points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds paired datasets of direct-translation and culturally-adapted red-teaming prompts for four languages using 1:1 seed matching. It measures attack success rate and a cultural realism score on four open-source models. Culturally-adapted prompts raise attack success rates in every language-model pair tested, while direct translations produce inputs whose cultural depth scores stay near zero. The work concludes that translation alone cannot produce valid safety tests for multilingual models.

Core claim

Culturally-adapted prompts produce positive Delta-ASR across all 16 language-model combinations (mean +9.3 pp) and direct-translation evaluation underestimates risk in 44 of 48 category-language combinations; direct-translation cultural depth scores average 0.17 out of 3.0 while adapted scores reach 2.51, showing that translation alone yields inputs systematically unlike those arising in real multicultural settings.

What carries the argument

Paired DT and CA datasets created by 1:1 seed matching, which allow head-to-head measurement of attack success rate and cultural realism across languages.

If this is right

  • Direct-translation benchmarks systematically underestimate model risk in non-English languages.
  • Threat-form distributions differ across languages, so a single translated template set cannot cover all relevant cases.
  • Valid multilingual safety evaluation requires explicit adaptation to local social norms and legal frameworks rather than surface-level translation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety testing pipelines that rely only on English-to-target translation will continue to miss context-dependent vulnerabilities until cultural adaptation steps are added.
  • The observed heterogeneity in threat forms suggests that language-specific red-teaming taxonomies may be needed beyond the four languages studied here.
  • Model developers could use the same seed-matching approach to generate training data that better reflects real-world multicultural usage.

Load-bearing premise

The 1:1 seed matching procedure accurately captures language-specific cultural contexts, threat forms, social norms, and legal frameworks without systematic bias or loss of validity.

What would settle it

An experiment in which culturally-adapted prompts fail to raise attack success rates above direct-translation baselines or in which direct-translation cultural depth scores equal those of adapted prompts.

Figures

Figures reproduced from arXiv: 2606.09178 by Hyeji Choi, Minwoo Kim, YongTaek Lim.

Figure 1
Figure 1. Figure 1: Overview of the study. (a) Direct translation preserves surface form but strips away cultural context. (b) The culturally-adapted pipeline separates attack intent from localized cultural content, generating one culturally grounded prompt per language (Kim et al., 2026). (c) Cultural Realism is evaluated along three axes: plausibility (C1), taxonomy adherence (C2), and cultural depth (C3) [PITH_FULL_IMAGE:… view at source ↗
read the original abstract

Multilingual safety evaluation of large language models (LLMs) has predominantly relied on direct translation (DT) of English benchmarks into target languages - an approach that converts surface-level linguistic form while failing to reflect the cultural context embedded in threat scenarios, social norms, and legal frameworks. We construct paired DT and culturally-adapted (CA) datasets via 1:1 seed matching for four languages - Korean (KO), Japanese (JA), Thai (TH), and Khmer (KM) - and compare Attack Success Rate (ASR) and Cultural Realism scores across four open-source LLM. CA prompts yield Delta-ASR > 0 across all 16 language x model combinations (mean +9.3 pp), and DT-based evaluation underestimates risk in 44 of 48 category x language combinations. Language-level analysis reveals that the distribution of threat forms is heterogeneous across languages. Cultural Realism analysis further shows that DT Cultural Depth (C3) scores remain consistently below 1.0 out of 3.0 across all four languages (mean 0.17), whereas CA scores reach up to 2.51, indicating that direct translation produces inputs systematically divergent from those encountered in real-world multicultural settings. These findings demonstrate that adapting benchmarks to language-specific cultural contexts - rather than relying on linguistic translation alone - is necessary for valid multilingual LLM safety evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that direct translation (DT) of English red-teaming benchmarks fails to capture cultural contexts in Korean, Japanese, Thai, and Khmer, leading to underestimation of LLM risks. Using 1:1 seed matching to create paired culturally-adapted (CA) datasets, it reports Delta-ASR > 0 across all 16 language-model combinations (mean +9.3 pp), DT underestimation in 44/48 category-language cases, and Cultural Realism (C3) scores of mean 0.17 for DT versus up to 2.51 for CA, concluding that cultural adaptation is necessary for valid multilingual safety evaluation.

Significance. If the CA construction is validated, the result would be significant for multilingual LLM safety research by providing concrete evidence that DT methods produce systematically divergent inputs and lower measured risk. The cross-language comparison and introduction of Cultural Realism scoring could influence benchmark design practices.

major comments (3)
  1. [Methods (1:1 seed matching procedure)] Dataset construction via 1:1 seed matching: No validation is reported for cultural fidelity (e.g., inter-annotator agreement, native-speaker blind assessment, or bias checks against over/under-adaptation). This is load-bearing for the central claim, as the observed ASR gaps and C3 differences could reflect construction artifacts rather than genuine language-specific norms, threat forms, or legal contexts.
  2. [Abstract and Results] Abstract and results reporting: Aggregate statistics (Delta-ASR mean +9.3 pp, 44/48 underestimation) are presented without dataset sizes, category definitions, statistical tests, error bars, model versions, or inter-rater reliability for Cultural Realism scores. These details are required to evaluate whether the headline comparisons are robust.
  3. [Evaluation Metrics] Cultural Realism (C3) metric: The scale (0-3.0), rubric, and application protocol across languages are not described, undermining the claim that DT scores (mean 0.17) indicate inputs 'systematically divergent from those encountered in real-world multicultural settings.'
minor comments (1)
  1. [Abstract] Abstract: 'four open-source LLM' should read 'four open-source LLMs'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with point-by-point responses and indicate planned revisions.

read point-by-point responses
  1. Referee: [Methods (1:1 seed matching procedure)] Dataset construction via 1:1 seed matching: No validation is reported for cultural fidelity (e.g., inter-annotator agreement, native-speaker blind assessment, or bias checks against over/under-adaptation). This is load-bearing for the central claim, as the observed ASR gaps and C3 differences could reflect construction artifacts rather than genuine language-specific norms, threat forms, or legal contexts.

    Authors: The 1:1 seed matching was executed by native-speaker teams for each language using a shared protocol to incorporate local norms and contexts. The manuscript does not report inter-annotator agreement or blind assessments. We will expand the Methods section with a full description of the protocol and any bias checks performed during construction, and we will explicitly note the absence of formal inter-rater metrics as a limitation. revision: yes

  2. Referee: [Abstract and Results] Abstract and results reporting: Aggregate statistics (Delta-ASR mean +9.3 pp, 44/48 underestimation) are presented without dataset sizes, category definitions, statistical tests, error bars, model versions, or inter-rater reliability for Cultural Realism scores. These details are required to evaluate whether the headline comparisons are robust.

    Authors: Dataset sizes, category definitions, and model versions appear in Sections 3 and 4 and Table 1. Statistical tests for ASR differences are reported in the results. Error bars and inter-rater reliability for C3 scores are not included. We will add error bars to aggregate statistics and report inter-rater reliability (or its absence) in the revised results section. revision: yes

  3. Referee: [Evaluation Metrics] Cultural Realism (C3) metric: The scale (0-3.0), rubric, and application protocol across languages are not described, undermining the claim that DT scores (mean 0.17) indicate inputs 'systematically divergent from those encountered in real-world multicultural settings.'

    Authors: The 0-3.0 scale, rubric, and native-speaker rating protocol are defined in Section 4.3 and Appendix B. We will elevate the rubric and protocol description into the main text to make the metric fully transparent and to support the interpretation of the DT versus CA scores. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical dataset comparison

full rationale

The paper reports direct empirical comparisons of ASR and Cultural Realism between DT and CA datasets constructed via 1:1 seed matching. No equations, fitted parameters, predictions derived from fits, or load-bearing self-citations appear in the provided text. The central claims (Delta-ASR > 0, DT underestimation in 44/48 cases) are presented as observed outcomes rather than derivations that reduce to the construction method by definition. The 1:1 matching procedure is an input assumption whose validity is external to any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claims rest on the unverified validity of the cultural adaptation construction method and the Cultural Realism scoring rubric; these are domain assumptions with no external grounding or falsification test described in the abstract.

axioms (2)
  • domain assumption 1:1 seed matching produces culturally valid and comparable threat scenarios across languages
    Invoked to justify the superiority of CA over DT datasets.
  • domain assumption Cultural Realism scores (0-3 scale) validly measure divergence from real-world multicultural inputs
    Used to quantify that DT scores (mean 0.17) are inadequate.

pith-pipeline@v0.9.1-grok · 5779 in / 1504 out tokens · 32905 ms · 2026-06-27T16:47:07.784822+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    Constitutional AI: Harmlessness from AI Feedback

    Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,

  2. [2]

    J., and Bing, L

    Deng, Y ., Zhang, W., Pan, S. J., and Bing, L. Multilingual jailbreak challenges in large language models. InInterna- tional Conference on Learning Representations, volume 2024, pp. 24634–24651,

  3. [3]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y ., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,

  4. [4]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  5. [5]

    Cage: A framework for culturally adaptive red-teaming benchmark generation.arXiv preprint arXiv:2602.20170,

    Kim, C., Lim, Y ., Kim, K., Kim, J., and Kim, M. Cage: A framework for culturally adaptive red-teaming benchmark generation.arXiv preprint arXiv:2602.20170,

  6. [6]

    Exaone 3.5: Series of large language models for real-world use cases,

    Technical report; arXiv:2412.04862. Li, L., Dong, B., Wang, R., Hu, X., Zuo, W., Lin, D., Qiao, Y ., and Shao, J. SALAD-Bench: A hierarchical and comprehensive safety benchmark for large language mod- els. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 3923–3954,

  7. [7]

    M., Miranda, L

    Lovenia, H., Mahendra, R., Akbar, S. M., Miranda, L. J. V ., Santoso, J., Aco, E., Fadhilah, A., Mansurov, J., Imperial, J. M., Kampman, O. P., et al. Seacrowd: A multilingual multimodal data hub and benchmark suite for southeast asian languages. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 5155–5203,

  8. [8]

    Red teaming language models with language models

    Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving, G. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3419–3448,

  9. [9]

    Qwen2.5 Technical Report

    Technical report; arXiv:2412.15115. Ranathunga, S., Lee, E.-S. A., Prifti Skenduli, M., Shekhar, R., Alam, M., and Kaur, R. Neural machine translation for low-resource languages: A survey.ACM Computing Surveys, 55(11):1–37,

  10. [10]

    Röttger, P., Vidgen, B., Nguyen, D., Waseem, Z., Margetts, H., and Pierrehumbert, J. B. Hatecheck: Functional tests for hate speech detection models. InACL-IJCNLP 2021,

  11. [11]

    A., and Yang, D

    Shi, W., Li, R., Zhang, Y ., Ziems, C., Yu, S., Horesh, R., De Paula, R. A., and Yang, D. Culturebank: An on- line community-driven knowledge base towards culturally aware language technologies. InFindings of the Associ- ation for Computational Linguistics: EMNLP 2024, pp. 4996–5025,

  12. [12]

    Safety assessment of Chinese large language models

    Sun, H., Zhang, Z., Deng, J., Cheng, J., and Huang, M. Safety assessment of Chinese large language models. arXiv preprint arXiv:2304.10436,

  13. [13]

    Alert: A comprehensive benchmark for assessing large language models’ safety through red teaming,

    Tedeschi, S., Friedrich, F., Schramowski, P., Kersting, K., Navigli, R., Nguyen, H., and Li, B. ALERT: A comprehensive benchmark for assessing large language models’ safety through red teaming.arXiv preprint arXiv:2404.08676,

  14. [14]

    Wang, W., Tu, Z., Chen, C., Yuan, Y ., Huang, J.-t., Jiao, W., and Lyu, M. R. All languages matter: On the multi- lingual safety of large language models.arXiv preprint arXiv:2310.00905, 2023a. 7 Culturally-Adapted Red-Teaming in East and Southeast Asia Wang, Y ., Li, H., Han, X., Nakov, P., and Baldwin, T. Do- not-answer: A dataset for evaluating safegua...

  15. [15]

    Yong, Z.-X., Menghini, C., and Bach, S. H. Low- resource languages jailbreak GPT-4.arXiv preprint arXiv:2310.02446,

  16. [16]

    The state of multilingual llm safety research: From measuring the language gap to mitigating it

    Yong, Z.-X., Ermis, B., Fadaee, M., Bach, S., and Kreutzer, J. The state of multilingual llm safety research: From measuring the language gap to mitigating it. InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 15856–15871,

  17. [17]

    Air-bench 2024: A safety benchmark based on risk categories from regulations and policies,

    Zeng, Y ., Yang, Y ., Zhou, A., Tan, J. Z., Tu, Y ., Mai, Y ., Kly- man, K., Pan, M., Jia, R., Song, D., et al. Air-bench 2024: A safety benchmark based on risk categories from regu- lations and policies.arXiv preprint arXiv:2407.17436,