IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia
Pith reviewed 2026-05-21 10:27 UTC · model grok-4.3
The pith
LLM safety alignments do not transfer evenly across 12 Indic languages spoken by over a billion people.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Safety alignment in multilingual LLMs exhibits significant drift across Indic languages, with cross-language agreement on safety judgments at only 12.8% and variance in safe response rates exceeding 17%. Some models over-refuse benign prompts in low-resource scripts while others under-flag unsafe content on sensitive topics, quantified through entropy, bias scores, and consistency indices.
What carries the argument
The IndicSafe benchmark, a set of 6,000 culturally grounded prompts translated into 12 Indic languages, together with metrics for prompt-level entropy, category bias scores, and multilingual consistency indices to measure safety drift.
If this is right
- Safety evaluations for LLMs must include language-specific testing rather than relying on English-centric results.
- Deployments in South Asian regions require adjustments to handle varying refusal behaviors across scripts and topics.
- Language-aware alignment techniques are necessary to reduce generalization gaps in safety training.
- Models may need separate handling for politically sensitive or culturally specific content per language.
Where Pith is reading between the lines
- Imbalanced training data likely contributes to the observed safety inconsistencies across languages.
- The benchmark could be extended to measure improvements after targeted fine-tuning on Indic data.
- Similar evaluations in other low-resource language families might show comparable safety transfer issues.
- Users in Indic regions face uneven protection from model harms depending on the language used.
Load-bearing premise
The 6,000 prompts stay culturally accurate and relevant to safety after being translated into the 12 languages, and the refusal criteria truly reflect model behavior without translation errors or evaluator bias.
What would settle it
Re-running the evaluations with human-verified translations and independent safety labeling that finds consistent safety rates across languages would challenge the claim of significant drift.
read the original abstract
As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8\%, and \texttt{SAFE} rate variance exceeds 17\% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces IndicSafe, the first benchmark for evaluating LLM safety in 12 Indic languages using 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics. It evaluates 10 leading LLMs on translated prompt variants and reports significant safety drift, with cross-language agreement at only 12.8% and SAFE rate variance exceeding 17% across languages. The work concludes that safety alignment does not transfer evenly and releases the benchmark to support language-aware alignment strategies.
Significance. If the core methodology holds, this benchmark is significant for addressing an important gap in multilingual LLM safety for low-resource languages spoken by over 1.2 billion people. The release of IndicSafe, combined with metrics such as prompt-level entropy, category bias scores, and multilingual consistency indices, provides a concrete resource for future work on regional harms and could influence alignment practices in Indic deployments.
major comments (2)
- [Abstract] Abstract: The central claim of significant safety drift (12.8% cross-language agreement and >17% SAFE rate variance) and the conclusion that safety alignment does not transfer evenly are load-bearing for the paper. These rest on the assumption that the 6,000 prompts retain identical safety-relevant intent after translation into 12 languages, yet the abstract provides no details on translation validation, back-translation accuracy, native-speaker fidelity scores, prompt selection criteria, or inter-annotator agreement for safety labels.
- [Evaluation] Evaluation section: The description of refusal and safety labeling criteria lacks specificity on how consistency is ensured across languages and scripts; without this, it is unclear whether the reported over-refusal in low-resource scripts and failures to flag unsafe content reflect model behavior or labeling/translation artifacts.
minor comments (2)
- The abstract mentions but does not define prompt-level entropy, category bias scores, and multilingual consistency indices; add explicit definitions or equations in the methods or results section.
- Consider including confidence intervals or statistical tests alongside the reported variance and agreement figures to strengthen the quantitative claims.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. These highlight important areas for clarifying our methodology on translation and labeling, which we address point by point below. We have revised the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of significant safety drift (12.8% cross-language agreement and >17% SAFE rate variance) and the conclusion that safety alignment does not transfer evenly are load-bearing for the paper. These rest on the assumption that the 6,000 prompts retain identical safety-relevant intent after translation into 12 languages, yet the abstract provides no details on translation validation, back-translation accuracy, native-speaker fidelity scores, prompt selection criteria, or inter-annotator agreement for safety labels.
Authors: We agree that the abstract, due to length constraints, does not include these details. The full manuscript describes prompt selection criteria by native experts for cultural grounding in Section 3, along with the translation process and validation steps. Inter-annotator agreement for safety labels is reported in the appendix. We will revise the abstract to briefly reference the native-speaker translation validation and prompt criteria to better support the central claims. revision: yes
-
Referee: [Evaluation] Evaluation section: The description of refusal and safety labeling criteria lacks specificity on how consistency is ensured across languages and scripts; without this, it is unclear whether the reported over-refusal in low-resource scripts and failures to flag unsafe content reflect model behavior or labeling/translation artifacts.
Authors: We acknowledge this observation and agree that greater specificity on cross-language consistency would strengthen the section. The manuscript outlines refusal detection via standardized criteria and human review for a subset of cases. We will revise the Evaluation section to add explicit details on language-specific handling, script considerations, and steps taken to check for translation artifacts, such as parallel reviews of prompt variants. revision: yes
Circularity Check
No circularity: purely empirical benchmark with direct measurements
full rationale
This is a standard empirical benchmark paper that constructs a dataset of 6,000 culturally grounded prompts, translates them into 12 Indic languages, evaluates 10 LLMs for safety refusals, and reports observable statistics such as 12.8% cross-language agreement and >17% SAFE rate variance. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims rest on direct model outputs and prompt-level metrics rather than any reduction to inputs by construction. The study is self-contained against external benchmarks because the reported agreement and variance figures are computed from fresh evaluations without self-referential definitions or load-bearing prior results from the same authors.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Safety of LLM outputs can be reliably measured by refusal rates and category-specific flagging on translated prompts.
- domain assumption Translations of English safety prompts preserve cultural meaning and intent in Indic languages.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present the first systematic evaluation of LLM safety across 12 Indic languages... cross-language agreement is just 12.8%, and SAFE rate variance exceeds 17% across languages.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions
Introduces Indi-RomCoM benchmark for evaluating LLMs on Romanized code-mixed Indic-English instructions across seven tasks, four languages, and three mixing levels.
-
Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators
LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.