IndicGuard: A Multilingual Safety Guard Model and Dataset for Indic Languages

Adwait Borate; Parth Bramhecha; Raviraj Joshi; Sairaj Bodhale; Smit Deshmukh

arxiv: 2606.22841 · v1 · pith:LYDLZGICnew · submitted 2026-06-22 · 💻 cs.CL · cs.LG

IndicGuard: A Multilingual Safety Guard Model and Dataset for Indic Languages

Parth Bramhecha , Smit Deshmukh , Sairaj Bodhale , Adwait Borate , Raviraj Joshi This is my paper

Pith reviewed 2026-06-26 08:45 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords Indic languagessafety guard modelmultilingual moderationcultural harmsjailbreak detectionlow-resource languagesLLM alignmentcontent moderation

0 comments

The pith

IndicGuard is a safety guard model and dataset built for ten Indic languages to catch regional harms and jailbreaks that English models miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IndicGuard, a dataset and fine-tuned 4B-parameter model covering ten major Indic languages and designed to moderate content according to local socio-political sensitivities and adversarial prompts. It fine-tunes an instruction-tuned Gemma model on this corpus to act as a real-time guardrail for policy compliance. Evaluations show the resulting system improves robustness to localized vulnerabilities, maintains consistency across conversation turns, beats the CultureGuard baseline on the tested languages, and still works on low-resource Indic languages left out of training.

Core claim

IndicGuard, constructed from a high-volume culturally nuanced safety dataset for ten Indic languages that includes regional harms, sensitive socio-political contexts, and adversarial jailbreaks, yields a fine-tuned 4B model that enhances LLM robustness against localized vulnerabilities, delivers high moderation consistency across turns, outperforms CultureGuard, and generalizes to low-resource Indic languages excluded from training.

What carries the argument

IndicGuard, the 4B-parameter model fine-tuned from Gemma-3-4B-IT on the curated Indic safety dataset, used for real-time content moderation and policy compliance checking.

If this is right

LLMs equipped with IndicGuard will show higher moderation consistency across multiple conversational turns.
Safety performance will exceed that of CultureGuard on the evaluated Indic languages.
The same fine-tuning approach will extend protection to low-resource Indic languages never seen during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Guard models built the same way could be created for other language families that have distinct cultural harm categories.
Native-language safety checks could reduce the need to translate Indic content into English before moderation.
The results point to the value of language-specific datasets when aligning models for regions outside high-resource English data.

Load-bearing premise

The curated dataset systematically and accurately captures regional harms, sensitive socio-political contexts, and adversarial jailbreaks in a way that supports real-world generalization.

What would settle it

Run IndicGuard and CultureGuard on a held-out test set of real Indic-language prompts containing socio-political sensitivities or jailbreaks and measure whether IndicGuard shows no improvement in moderation accuracy or consistency.

Figures

Figures reproduced from arXiv: 2606.22841 by Adwait Borate, Parth Bramhecha, Raviraj Joshi, Sairaj Bodhale, Smit Deshmukh.

**Figure 1.** Figure 1: Data collection overview Listing 1: IndicGuard Dataset Schema Example { "id": " ae5bef96181e45c490ea69b585f11785 ", " prompt_label ": " unsafe ", " response_label ": " unsafe ", " violated_categories ": " Controlled / Regulated Substances ", " prompt_label_source ": " human ", " response_label_source ": " llm_jury ", " prompt ": " ...", " response ": " ...", " tag": " Culture_adaptive ", " language ": " Be… view at source ↗

**Figure 3.** Figure 3: End-to-end IndicGuard training pipeline. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 2.** Figure 2: Fine-tuning architecture for IndicGuard.LoRA [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Evaluation pipeline for IndicGuard. malformed generations, explicit in metric computation. 5.7 Software and Infrastructure All experiments were implemented using Python 3.11, PyTorch 2.6.0 (CUDA 12.4), HuggingFace Transformers 4.55.4, TRL 0.22.2, and Unsloth 2026.1.4. Mixed-precision training defaulted to float32 because of bfloat16 limitations on the Tesla T4 architecture. Dataset loading and preprocess… view at source ↗

**Figure 5.** Figure 5: Marathi Culture-Adaptive example from the [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

As Large Language Models (LLMs) achieve widespread integration across diverse linguistic landscapes, ensuring their safety and alignment with regional normative values remains a critical challenge. Current safety mechanisms are predominantly optimized for English-centric frameworks, often failing to capture the unique socio-cultural sensitivities and localized categories of harm inherent to the Indic region. To address this gap, we introduce IndicGuard, a multilingual safety guard model and dataset for Indic languages. We construct a high-volume, culturally nuanced safety dataset encompassing ten major Indic languages, systematically curated to capture regional harms, sensitive socio-political contexts, and adversarial jailbreaks. Leveraging this corpus, we fine-tune a 4B-parameter instruction-tuned model based on Gemma-3-4B-IT to serve as a multilingual safety guardrail for real-time content moderation and policy compliance checking. Our empirical evaluations demonstrate that IndicGuard significantly enhances LLM robustness against localized vulnerabilities, achieving high moderation consistency across different conversational turns. Crucially, IndicGuard consistently outperforms the existing baseline model, CultureGuard, across evaluated languages. Finally, we demonstrate that our model effectively generalizes to low-resource Indic languages excluded from training, substantiating the structural robustness and cross-lingual transfer capabilities of the framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IndicGuard ships a new Indic safety dataset and 4B guard model with claims of beating CultureGuard and generalizing to unseen languages, but the evals need close inspection.

read the letter

The paper introduces a high-volume safety dataset for ten Indic languages and fine-tunes a Gemma-3-4B-IT model into IndicGuard. The core move is curating data that includes regional harms, socio-political contexts, and jailbreaks, then showing the model beats the CultureGuard baseline while holding up on low-resource languages left out of training.

The dataset construction is the part that actually moves the needle. Most safety work stays English-first, so targeted Indic coverage fills a practical hole for a large user base. If the curation process is reproducible and the cross-lingual transfer numbers are solid, that result would be worth noting.

The soft spot remains the evaluation. The abstract states outperformance and consistency across turns but gives no metrics, splits, or error bars in the summary. The key assumption—that the dataset faithfully captures localized vulnerabilities—still sits on the methods and results sections. Without those details the generalization claim is hard to weigh.

This is for researchers working on multilingual safety or regional alignment. Anyone already using or extending CultureGuard-style guardrails would get direct value from the new data and model if released.

It deserves peer review because the topic is under-served and they produce concrete artifacts, even if the quantitative support needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces IndicGuard, a multilingual safety guard model and dataset for Indic languages. It constructs a high-volume, culturally nuanced safety dataset for ten major Indic languages capturing regional harms, sensitive socio-political contexts, and adversarial jailbreaks; fine-tunes a 4B-parameter instruction-tuned model based on Gemma-3-4B-IT; and claims that the resulting guardrail enhances LLM robustness against localized vulnerabilities, achieves high moderation consistency, consistently outperforms the CultureGuard baseline across evaluated languages, and generalizes to low-resource Indic languages excluded from training.

Significance. If the empirical results hold with proper controls, this would address a meaningful gap in non-English LLM safety for Indic languages, which have distinct socio-cultural sensitivities not covered by English-centric frameworks. The cross-lingual generalization claim to unseen low-resource languages, if substantiated, would be a notable strength for extending safety mechanisms beyond high-resource settings.

major comments (2)

[Abstract and §5] Abstract and §5 (Experiments): the central claims of outperformance over CultureGuard and generalization to excluded languages are asserted without any reported metrics, baselines, data splits, error bars, or evaluation protocol details; this is load-bearing for the empirical contribution and prevents assessment of whether the results support the headline conclusions.
[§3] §3 (Dataset Construction): the description of how the corpus systematically captures regional harms, socio-political contexts, and adversarial jailbreaks lacks quantitative details on coverage, annotation guidelines, inter-annotator agreement, or validation steps; this directly underpins the weakest assumption regarding real-world generalization and cross-lingual transfer.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy delta vs. baseline) to ground the claims of outperformance.
[§4] Notation for the ten Indic languages and the exact fine-tuning hyperparameters should be listed explicitly in a table for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comments point by point below and will revise the manuscript to provide the requested details.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Experiments): the central claims of outperformance over CultureGuard and generalization to excluded languages are asserted without any reported metrics, baselines, data splits, error bars, or evaluation protocol details; this is load-bearing for the empirical contribution and prevents assessment of whether the results support the headline conclusions.

Authors: We agree that the abstract and §5 currently lack the quantitative details needed to substantiate the claims. In the revised version we will add explicit performance metrics (e.g., F1, accuracy) for IndicGuard versus CultureGuard across all evaluated languages, train/validation/test split sizes, the precise evaluation protocol, and error bars or confidence intervals from multiple runs. These additions will allow readers to assess the strength of the outperformance and generalization results. revision: yes
Referee: [§3] §3 (Dataset Construction): the description of how the corpus systematically captures regional harms, socio-political contexts, and adversarial jailbreaks lacks quantitative details on coverage, annotation guidelines, inter-annotator agreement, or validation steps; this directly underpins the weakest assumption regarding real-world generalization and cross-lingual transfer.

Authors: We acknowledge that §3 would be strengthened by quantitative and procedural details. We will expand the section to report dataset coverage statistics (instances per language and harm category), the complete annotation guidelines, inter-annotator agreement scores, and the validation procedures used. This will better document how regional harms and contexts were captured and support claims about generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript describes an empirical pipeline of dataset curation across ten Indic languages followed by fine-tuning of Gemma-3-4B-IT and standard hold-out evaluation; no equations, derivations, or parameter-fitting steps are present that could reduce a claimed prediction to its own inputs by construction. All performance claims rest on external test sets and cross-lingual generalization experiments rather than self-referential definitions or self-citation chains. The work is therefore self-contained as a conventional ML engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger records the key unverified premises stated in the abstract. No free parameters, invented entities, or additional axioms can be extracted beyond the domain assumption about dataset quality.

axioms (1)

domain assumption The constructed dataset captures regional harms, sensitive socio-political contexts, and adversarial jailbreaks in a culturally accurate and comprehensive way.
This premise is required for the claimed robustness and generalization benefits to follow from the fine-tuning step described in the abstract.

pith-pipeline@v0.9.1-grok · 5753 in / 1259 out tokens · 25193 ms · 2026-06-26T08:45:53.420402+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 1 canonical work pages

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

Muhammad Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania, Siddhant Shivdutt Singh, Alham Fikri Aji, Jacki O’Neill, Ashutosh Modi, and Monojit Choudhury. 2024. Towards measuring and modeling “culture” in llms: A survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15763--15784

2024
[10]

Masoomali Fatehkia, Enes Altinisik, and Husrev Taha Sencar. 2026. Fanarguard: A culturally-aware moderation filter for arabic language models. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7848--7869

2026
[11]

Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. 2025. Aegis2. 0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational L...

2025
[12]

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. 2024. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. Advances in neural information processing systems, 37:8093--8131

2024
[14]

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36:24678--24704

2023
[16]

Raviraj Bhuminand Joshi, Rakesh Paul, Kanishk Singla, Anusha Kamath, Michael Evans, Katherine Luna, Shaona Ghosh, Utkarsh Vaidya, Eileen Margaret Peters Long, Sanjay Singh Chauhan, and 1 others. 2025. Cultureguard: Towards culturally-aware dataset and guard model for multilingual safety applications. In Proceedings of the 14th International Joint Conferen...

2025
[17]

Chen Cecilia Liu, Iryna Gurevych, and Anna Korhonen. 2025. Culturally aware and adapted nlp: A taxonomy and a survey of the state of the art. Transactions of the Association for Computational Linguistics, 13:652--689

2025
[18]

Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. 2023. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 15009--15018

2023
[20]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744

2022
[21]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728--53741

2023
[22]

Paul R \"o ttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. 2024. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Lo...

2024
[23]

Kush R Varshney. 2024. Decolonial ai alignment: Openness, visesa-dharma, and including excluded knowledges. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 1467--1481

2024
[26]

0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails , author=

Aegis2. 0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[27]

Cultureguard: Towards culturally-aware dataset and guard model for multilingual safety applications , author=. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , pages=
[28]

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[29]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Towards measuring and modeling “culture” in LLMs: A survey , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[30]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[31]

arXiv preprint arXiv:2204.05862 , year=

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

Pith/arXiv arXiv
[32]

arXiv preprint arXiv:2212.08073 , year=

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

Pith/arXiv arXiv
[33]

Advances in neural information processing systems , volume=

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms , author=. Advances in neural information processing systems , volume=
[34]

Advances in Neural Information Processing Systems , volume=

Beavertails: Towards improved safety alignment of llm via a human-preference dataset , author=. Advances in Neural Information Processing Systems , volume=
[35]

arXiv preprint arXiv:2312.06674 , year=

Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

Pith/arXiv arXiv
[36]

Proceedings of the AAAI conference on artificial intelligence , volume=

A holistic approach to undesired content detection in the real world , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[37]

arXiv preprint arXiv:2402.04249 , year=

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

Pith/arXiv arXiv
[38]

arXiv preprint arXiv:2407.21772 , year=

Shieldgemma: Generative ai content moderation based on gemma , author=. arXiv preprint arXiv:2407.21772 , year=

Pith/arXiv arXiv
[39]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[40]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024
[41]

Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations , pages=

Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails , author=. Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations , pages=

2023
[42]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

CantTalkAboutThis: Aligning language models to stay on topic in dialogues , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024
[43]

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=

Decolonial AI alignment: Openness, visesa-dharma, and including excluded knowledges , author=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=
[44]

Transactions of the Association for Computational Linguistics , volume=

Culturally aware and adapted nlp: A taxonomy and a survey of the state of the art , author=. Transactions of the Association for Computational Linguistics , volume=. 2025 , publisher=

2025
[45]

arXiv preprint arXiv:2308.16149 , year=

Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models , author=. arXiv preprint arXiv:2308.16149 , year=

arXiv
[46]

arXiv preprint arXiv:2307.15043 , year=

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

Pith/arXiv arXiv
[47]

arXiv preprint arXiv:2205.14728 , year=

L3cube-mahanlp: Marathi natural language processing datasets, models, and library , author=. arXiv preprint arXiv:2205.14728 , year=

arXiv

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[2] [2]

Publications Manual , year = "1983", publisher =

1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[5] [5]

Dan Gusfield , title =. 1997

1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[8] [8]

Muhammad Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania, Siddhant Shivdutt Singh, Alham Fikri Aji, Jacki O’Neill, Ashutosh Modi, and Monojit Choudhury. 2024. Towards measuring and modeling “culture” in llms: A survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15763--15784

2024

[9] [10]

Masoomali Fatehkia, Enes Altinisik, and Husrev Taha Sencar. 2026. Fanarguard: A culturally-aware moderation filter for arabic language models. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7848--7869

2026

[10] [11]

Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. 2025. Aegis2. 0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational L...

2025

[11] [12]

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. 2024. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. Advances in neural information processing systems, 37:8093--8131

2024

[12] [14]

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36:24678--24704

2023

[13] [16]

Raviraj Bhuminand Joshi, Rakesh Paul, Kanishk Singla, Anusha Kamath, Michael Evans, Katherine Luna, Shaona Ghosh, Utkarsh Vaidya, Eileen Margaret Peters Long, Sanjay Singh Chauhan, and 1 others. 2025. Cultureguard: Towards culturally-aware dataset and guard model for multilingual safety applications. In Proceedings of the 14th International Joint Conferen...

2025

[14] [17]

Chen Cecilia Liu, Iryna Gurevych, and Anna Korhonen. 2025. Culturally aware and adapted nlp: A taxonomy and a survey of the state of the art. Transactions of the Association for Computational Linguistics, 13:652--689

2025

[15] [18]

Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. 2023. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 15009--15018

2023

[16] [20]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744

2022

[17] [21]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728--53741

2023

[18] [22]

Paul R \"o ttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. 2024. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Lo...

2024

[19] [23]

Kush R Varshney. 2024. Decolonial ai alignment: Openness, visesa-dharma, and including excluded knowledges. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 1467--1481

2024

[20] [26]

0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails , author=

Aegis2. 0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025

[21] [27]

Cultureguard: Towards culturally-aware dataset and guard model for multilingual safety applications , author=. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , pages=

[22] [28]

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[23] [29]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Towards measuring and modeling “culture” in LLMs: A survey , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[24] [30]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

[25] [31]

arXiv preprint arXiv:2204.05862 , year=

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

Pith/arXiv arXiv

[26] [32]

arXiv preprint arXiv:2212.08073 , year=

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

Pith/arXiv arXiv

[27] [33]

Advances in neural information processing systems , volume=

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms , author=. Advances in neural information processing systems , volume=

[28] [34]

Advances in Neural Information Processing Systems , volume=

Beavertails: Towards improved safety alignment of llm via a human-preference dataset , author=. Advances in Neural Information Processing Systems , volume=

[29] [35]

arXiv preprint arXiv:2312.06674 , year=

Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

Pith/arXiv arXiv

[30] [36]

Proceedings of the AAAI conference on artificial intelligence , volume=

A holistic approach to undesired content detection in the real world , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[31] [37]

arXiv preprint arXiv:2402.04249 , year=

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

Pith/arXiv arXiv

[32] [38]

arXiv preprint arXiv:2407.21772 , year=

Shieldgemma: Generative ai content moderation based on gemma , author=. arXiv preprint arXiv:2407.21772 , year=

Pith/arXiv arXiv

[33] [39]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

[34] [40]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024

[35] [41]

Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations , pages=

Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails , author=. Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations , pages=

2023

[36] [42]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

CantTalkAboutThis: Aligning language models to stay on topic in dialogues , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024

[37] [43]

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=

Decolonial AI alignment: Openness, visesa-dharma, and including excluded knowledges , author=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=

[38] [44]

Transactions of the Association for Computational Linguistics , volume=

Culturally aware and adapted nlp: A taxonomy and a survey of the state of the art , author=. Transactions of the Association for Computational Linguistics , volume=. 2025 , publisher=

2025

[39] [45]

arXiv preprint arXiv:2308.16149 , year=

Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models , author=. arXiv preprint arXiv:2308.16149 , year=

arXiv

[40] [46]

arXiv preprint arXiv:2307.15043 , year=

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

Pith/arXiv arXiv

[41] [47]

arXiv preprint arXiv:2205.14728 , year=

L3cube-mahanlp: Marathi natural language processing datasets, models, and library , author=. arXiv preprint arXiv:2205.14728 , year=

arXiv