pith. sign in

arxiv: 2603.17915 · v2 · pith:WNKU3MQNnew · submitted 2026-03-18 · 💻 cs.CL · cs.AI

IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

Pith reviewed 2026-05-21 10:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM safetymultilingual LLMsIndic languagessafety benchmarksafety alignmentcultural promptsSouth Asiacross-lingual evaluation
0
0 comments X

The pith

LLM safety alignments do not transfer evenly across 12 Indic languages spoken by over a billion people.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a benchmark called IndicSafe to test how well large language models maintain safety when prompts are translated into 12 Indic languages. It uses 6,000 culturally specific prompts on topics like caste, religion, gender, health, and politics. Testing 10 models shows very low agreement on what counts as safe across languages, at just 12.8 percent, with safety rates varying by more than 17 percent. This reveals that current safety training does not work the same way for all languages, especially those with less training data. The benchmark is released to help develop better, language-specific safety methods for real-world use in South Asia.

Core claim

Safety alignment in multilingual LLMs exhibits significant drift across Indic languages, with cross-language agreement on safety judgments at only 12.8% and variance in safe response rates exceeding 17%. Some models over-refuse benign prompts in low-resource scripts while others under-flag unsafe content on sensitive topics, quantified through entropy, bias scores, and consistency indices.

What carries the argument

The IndicSafe benchmark, a set of 6,000 culturally grounded prompts translated into 12 Indic languages, together with metrics for prompt-level entropy, category bias scores, and multilingual consistency indices to measure safety drift.

If this is right

  • Safety evaluations for LLMs must include language-specific testing rather than relying on English-centric results.
  • Deployments in South Asian regions require adjustments to handle varying refusal behaviors across scripts and topics.
  • Language-aware alignment techniques are necessary to reduce generalization gaps in safety training.
  • Models may need separate handling for politically sensitive or culturally specific content per language.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Imbalanced training data likely contributes to the observed safety inconsistencies across languages.
  • The benchmark could be extended to measure improvements after targeted fine-tuning on Indic data.
  • Similar evaluations in other low-resource language families might show comparable safety transfer issues.
  • Users in Indic regions face uneven protection from model harms depending on the language used.

Load-bearing premise

The 6,000 prompts stay culturally accurate and relevant to safety after being translated into the 12 languages, and the refusal criteria truly reflect model behavior without translation errors or evaluator bias.

What would settle it

Re-running the evaluations with human-verified translations and independent safety labeling that finds consistent safety rates across languages would challenge the claim of significant drift.

read the original abstract

As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8\%, and \texttt{SAFE} rate variance exceeds 17\% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces IndicSafe, the first benchmark for evaluating LLM safety in 12 Indic languages using 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics. It evaluates 10 leading LLMs on translated prompt variants and reports significant safety drift, with cross-language agreement at only 12.8% and SAFE rate variance exceeding 17% across languages. The work concludes that safety alignment does not transfer evenly and releases the benchmark to support language-aware alignment strategies.

Significance. If the core methodology holds, this benchmark is significant for addressing an important gap in multilingual LLM safety for low-resource languages spoken by over 1.2 billion people. The release of IndicSafe, combined with metrics such as prompt-level entropy, category bias scores, and multilingual consistency indices, provides a concrete resource for future work on regional harms and could influence alignment practices in Indic deployments.

major comments (2)
  1. [Abstract] Abstract: The central claim of significant safety drift (12.8% cross-language agreement and >17% SAFE rate variance) and the conclusion that safety alignment does not transfer evenly are load-bearing for the paper. These rest on the assumption that the 6,000 prompts retain identical safety-relevant intent after translation into 12 languages, yet the abstract provides no details on translation validation, back-translation accuracy, native-speaker fidelity scores, prompt selection criteria, or inter-annotator agreement for safety labels.
  2. [Evaluation] Evaluation section: The description of refusal and safety labeling criteria lacks specificity on how consistency is ensured across languages and scripts; without this, it is unclear whether the reported over-refusal in low-resource scripts and failures to flag unsafe content reflect model behavior or labeling/translation artifacts.
minor comments (2)
  1. The abstract mentions but does not define prompt-level entropy, category bias scores, and multilingual consistency indices; add explicit definitions or equations in the methods or results section.
  2. Consider including confidence intervals or statistical tests alongside the reported variance and agreement figures to strengthen the quantitative claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. These highlight important areas for clarifying our methodology on translation and labeling, which we address point by point below. We have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of significant safety drift (12.8% cross-language agreement and >17% SAFE rate variance) and the conclusion that safety alignment does not transfer evenly are load-bearing for the paper. These rest on the assumption that the 6,000 prompts retain identical safety-relevant intent after translation into 12 languages, yet the abstract provides no details on translation validation, back-translation accuracy, native-speaker fidelity scores, prompt selection criteria, or inter-annotator agreement for safety labels.

    Authors: We agree that the abstract, due to length constraints, does not include these details. The full manuscript describes prompt selection criteria by native experts for cultural grounding in Section 3, along with the translation process and validation steps. Inter-annotator agreement for safety labels is reported in the appendix. We will revise the abstract to briefly reference the native-speaker translation validation and prompt criteria to better support the central claims. revision: yes

  2. Referee: [Evaluation] Evaluation section: The description of refusal and safety labeling criteria lacks specificity on how consistency is ensured across languages and scripts; without this, it is unclear whether the reported over-refusal in low-resource scripts and failures to flag unsafe content reflect model behavior or labeling/translation artifacts.

    Authors: We acknowledge this observation and agree that greater specificity on cross-language consistency would strengthen the section. The manuscript outlines refusal detection via standardized criteria and human review for a subset of cases. We will revise the Evaluation section to add explicit details on language-specific handling, script considerations, and steps taken to check for translation artifacts, such as parallel reviews of prompt variants. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with direct measurements

full rationale

This is a standard empirical benchmark paper that constructs a dataset of 6,000 culturally grounded prompts, translates them into 12 Indic languages, evaluates 10 LLMs for safety refusals, and reports observable statistics such as 12.8% cross-language agreement and >17% SAFE rate variance. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims rest on direct model outputs and prompt-level metrics rather than any reduction to inputs by construction. The study is self-contained against external benchmarks because the reported agreement and variance figures are computed from fresh evaluations without self-referential definitions or load-bearing prior results from the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Benchmark creation relies on standard assumptions about prompt-based safety evaluation and translation fidelity; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Safety of LLM outputs can be reliably measured by refusal rates and category-specific flagging on translated prompts.
    Central to the evaluation protocol described in the abstract.
  • domain assumption Translations of English safety prompts preserve cultural meaning and intent in Indic languages.
    Required for the cross-language comparison to be valid.

pith-pipeline@v0.9.0 · 5736 in / 1214 out tokens · 61236 ms · 2026-05-21T10:27:52.734603+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions

    cs.CL 2026-06 unverdicted novelty 7.0

    Introduces Indi-RomCoM benchmark for evaluating LLMs on Romanized code-mixed Indic-English instructions across seven tasks, four languages, and three mixing levels.

  2. Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

    cs.AI 2026-06 unverdicted novelty 5.0

    LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.