BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture
Pith reviewed 2026-05-18 01:51 UTC · model grok-4.3
The pith
Large language models display inconsistent moral reasoning when evaluated on Bengali cultural contexts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce BengaliMoralBench spanning five moral domains each with ten subtopics for a total of fifty culturally grounded categories. Scenarios receive annotations from native-speaker consensus under three ethical lenses, and zero-shot evaluations across open-weight and closed-source models including recent variants show substantial variation in performance across lenses and domains along with persistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness.
What carries the argument
BengaliMoralBench, the large-scale ethics benchmark with native-speaker consensus annotations under virtue, commonsense, and justice ethics for evaluating LLMs in Bengali contexts.
If this is right
- Current LLMs may generate responses that conflict with Bengali moral norms in areas like family and religion.
- The benchmark can be used to guide improvements in model training for better cultural sensitivity.
- Deployment of LLMs in Bangladesh and similar markets requires careful ethical auditing to mitigate risks.
- Multilingual models need targeted enhancements beyond general language capabilities to address moral fairness issues.
Where Pith is reading between the lines
- Extending this approach to other low-resource languages could create a network of culturally specific ethics benchmarks.
- Improved performance on BengaliMoralBench might correlate with better real-world utility in Bengali-language applications.
- Future work could involve dynamic scenarios that evolve with cultural changes rather than fixed annotations.
Load-bearing premise
Native-speaker consensus annotations under the three ethical lenses accurately and representatively capture Bengali moral norms across the chosen 50 subtopics without significant cultural bias or coverage gaps.
What would settle it
A replication study showing that independent groups of native Bengali speakers produce significantly different annotations from the original consensus, or all models achieving uniformly high scores without variation across lenses.
Figures
read the original abstract
As multilingual Large Language Models (LLMs) gain traction across South Asia, their alignment with local ethical norms, particularly for Bengali, spoken by over 285 million people worldwide and among the most widely spoken languages globally, remains underexplored. Existing ethics benchmarks are predominantly English-centric and shaped by Western moral frameworks, overlooking cultural nuances vital for real-world deployment. To address this gap, we introduce BengaliMoralBench, a large-scale ethics benchmark designed for Bengali language and sociocultural contexts. Our benchmark spans five moral domains: (1) Daily Activities, (2) Habits, (3) Parenting, (4) Family Relationships, and (5) Religious Activities, each subdivided into ten culturally grounded categories, totaling 50 subtopics. Each scenario is annotated through native-speaker consensus under three ethical lenses: virtue ethics, commonsense ethics, and justice ethics. We conduct a systematic zero-shot evaluation under a unified prompting protocol across both open-weight and closed-source models, including recent Llama and Gemma variants, Qwen and DeepSeek models, frontier models (GPT-4o-mini and Gemini 1.5 Pro), and a large multilingual baseline (Qwen3-Next-80B). Results show substantial variation in performance across lenses and domains, and our qualitative analysis reveals persistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness. These findings expose critical limitations of current LLMs in non-Western settings and underscore the need for culturally grounded evaluation. BengaliMoralBench provides a foundation for responsible localization and benchmarking to support the deployment of language technologies in culturally diverse, low-resource markets such as Bangladesh.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BengaliMoralBench, a benchmark for auditing moral reasoning in LLMs within Bengali language and culture. It spans five domains (Daily Activities, Habits, Parenting, Family Relationships, Religious Activities) with ten culturally grounded subtopics each (50 total). Scenarios are annotated via native-speaker consensus under three ethical lenses (virtue ethics, commonsense ethics, justice ethics). The authors conduct zero-shot evaluations across open-weight models (Llama, Gemma, Qwen, DeepSeek variants), closed-source models (GPT-4o-mini, Gemini 1.5 Pro), and a multilingual baseline using a unified prompting protocol, reporting substantial performance variation across lenses and domains plus qualitative evidence of persistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness.
Significance. If the annotations reliably capture Bengali sociocultural norms, the work would be significant for filling the gap in non-Western, non-English ethics benchmarks for LLMs. The independent construction of the dataset, systematic zero-shot evaluation across diverse models, and unified prompting protocol provide a reproducible foundation for culturally grounded assessment in low-resource South Asian contexts. This could support responsible localization of language technologies.
major comments (2)
- [§3] §3 (Benchmark Construction and Annotation): The native-speaker consensus annotation process is described at a high level but provides no inter-annotator agreement statistics, number of annotators or scenarios, demographic details on the annotator pool (e.g., regional, religious, or socioeconomic diversity across the 285M+ Bengali-speaking population in Bangladesh and India), or exclusion criteria. This is load-bearing for the central claims, as the interpretation of LLM performance deviations as evidence of weaknesses in cultural grounding, commonsense reasoning, and moral fairness assumes the labels accurately and representatively reflect broad Bengali moral norms under the three lenses.
- [§5] §5 (Results and Qualitative Analysis): The reported 'substantial variation' and 'persistent weaknesses' rest on model outputs compared to the consensus labels, yet the section does not include quantitative breakdowns (e.g., per-lens accuracy tables tied to specific subtopics) or error analysis that would allow readers to assess whether deviations stem from model deficiencies versus potential annotation gaps or biases.
minor comments (3)
- [Abstract] Abstract: The description of the benchmark as 'large-scale' would be improved by stating the exact total number of scenarios or annotations to better contextualize scope and enable quick comparison with prior ethics benchmarks.
- [Related Work] Related Work: The discussion of English-centric benchmarks could benefit from additional citations to recent multilingual or South Asian ethics evaluation efforts for fuller positioning.
- [Tables and Figures] Tables/Figures: Ensure performance comparison tables include clear column labels for each ethical lens and model variant to improve readability of the variation results.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive review of our manuscript on BengaliMoralBench. We address each major comment below and outline the revisions we will make to improve clarity and rigor.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction and Annotation): The native-speaker consensus annotation process is described at a high level but provides no inter-annotator agreement statistics, number of annotators or scenarios, demographic details on the annotator pool (e.g., regional, religious, or socioeconomic diversity across the 285M+ Bengali-speaking population in Bangladesh and India), or exclusion criteria. This is load-bearing for the central claims, as the interpretation of LLM performance deviations as evidence of weaknesses in cultural grounding, commonsense reasoning, and moral fairness assumes the labels accurately and representatively reflect broad Bengali moral norms under the three lenses.
Authors: We agree that the current description of the annotation process in §3 is insufficiently detailed to fully support our interpretive claims. In the revised manuscript we will expand this section to report inter-annotator agreement statistics, the precise number of annotators and scenarios, demographic characteristics of the annotator pool (including regional, religious, and socioeconomic diversity), and the exclusion criteria employed. These additions will increase transparency and strengthen the evidential basis for treating the consensus labels as representative of Bengali moral norms. revision: yes
-
Referee: [§5] §5 (Results and Qualitative Analysis): The reported 'substantial variation' and 'persistent weaknesses' rest on model outputs compared to the consensus labels, yet the section does not include quantitative breakdowns (e.g., per-lens accuracy tables tied to specific subtopics) or error analysis that would allow readers to assess whether deviations stem from model deficiencies versus potential annotation gaps or biases.
Authors: We concur that more granular quantitative reporting and error analysis are needed to allow readers to evaluate the sources of observed deviations. In the revised §5 we will add per-lens accuracy tables disaggregated by subtopic, together with a structured error analysis that categorizes model outputs and discusses the relative contributions of model limitations versus possible annotation variability or bias. This will make the evidence for our conclusions more transparent and interpretable. revision: yes
Circularity Check
No circularity: independent benchmark with direct evaluation
full rationale
The paper constructs BengaliMoralBench as a new dataset across five domains and 50 subtopics, with native-speaker consensus annotations under virtue, commonsense, and justice lenses. It then performs zero-shot LLM evaluations using a unified prompting protocol. No equations, fitted parameters, self-citations, or ansatzes are invoked in a load-bearing manner. Results on model weaknesses derive directly from comparison against the created annotations rather than reducing to prior inputs or definitions by construction. This is a standard benchmark paper with self-contained derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Native-speaker consensus under virtue, commonsense, and justice ethics provides accurate ground truth for Bengali moral reasoning.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
BengaliMoralBench... 3,000 human-annotated scenarios spanning... triadic ethical framework of Justice, Virtue, and Commonsense... zero-shot evaluation... accuracy ranging broadly from 60% to over 90%
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
native-speaker consensus annotations under the three ethical lenses
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2505.21092. AribaKhan,StephenCasper,andDylanHadfield-Menell. Randomness,notrepresentation: Theunreliabilityof evaluating cultural alignment in llms. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’25, page 2151–2165, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9...
-
[2]
evaluates LLM alignment with fundamental ethical values in various interaction scenarios, while ForbiddenQuestions (Shen et al., 2024) tests adherence to predefined ethical guidelines by assessing refusal to generate unsafe content. Additionally, the LLM Ethics Benchmark (Jiao et al., 2025) provides a systematic framework for evaluating moral reasoning, q...
work page 2024
-
[3]
offers a large-scale morality benchmark referencing normative theories like Justice, Utilitarianism, Deontology, Virtue Ethics, and Commonsense Morality, containing over 130,000 examples. Table 5:Comparative Analysis of Existing Ethical AI Benchmarks Benchmark Name Primary Focus Key Design PrinciplesEvaluation MethodologiesNoted Biases/Limitations Truthfu...
work page 2022
-
[4]
Bazar run, 2. Rickshaw/CNG commute, 3. Sharing tiffin, 4. Queueing at offices, 5. Neighbour favours, 6. Tea stall chats, 7. Load-shedding etiquette, 8. Wedding invites, 9. Cyclone prep, 10. Digital payments. Habits 1. Right/left-hand use, 2. Shoes indoors, 3. Saving water/electricity, 4. Honorifics, 5. Greeting elders, 6. Spitting/littering, 7. Rainwater ...
-
[5]
Joint vs nuclear living, 2. Supporting elders, 3. Dowry, 4. Cousin guardianship, 5. Inheritance division, 6. Sibling weddings, 7. Land disputes, 8. Disabled care, 9. Interfaith love, 10. Festival gifts. Religious Activities
-
[6]
Workplace salat, 2. Iftar with non-Muslims, 3. Friday closures, 4. Zakat vs charity, 5. Qurbani distribution, 6. Puja respect, 7. Hijab in labs, 8. Ramadan music, 9. Halal loans,
-
[7]
Aqeeqah animal choice. indicators such asvalo(“good”) orkharap(“bad”). Following the pilot phase and subsequent guideline revisions, inter-annotator agreement significantly improved, rising from𝜅=0.61to𝜅=0.87. B.3. Evaluation Metrics To comprehensively assess model performance onBengaliMoralBench, we report four standard classification metrics across the ...
-
[8]
The model misinterprets logistical action as morally neutral
Commonsense Failures in Daily ActivitiesIn scenarios such asRickshaw Commute, models fail to recognize implicit virtuous choices, here, Sayem walking to school for his parents’ Ramadan shopping. The model misinterprets logistical action as morally neutral. This demonstrates a deficiency in context-aware commonsense reasoning and an inability to infer virt...
-
[9]
Justice Ethics Violations in Family RelationshipsFor statements involving structural inequality, such asSiblings’ Weddings, models incorrectly label clearly unethical gender-biased prioritization as acceptable. The model’s prediction aligns with prevalent societal norms rather than critiquing them, revealing sensitivity to statistical regularities in trai...
-
[10]
Virtue Ethics Misclassification in HabitsActs of cultural etiquette, e.g., removing shoes before entering a relative’s house, are frequently dismissed as non-virtuous. The model undervalues subtle moral actions encoded in South Asian social norms, signaling a Western-centric bias in virtue encoding that prioritizes explicit moral acts over culturally nuan...
-
[11]
Cultural Misalignment in Parenting DecisionsIn scenarios likeSchool Choice, models fail to recognize moral pluralism in parental respect for children’s educational autonomy. The preference for Western moral schemas leads to misclassification of culturally salient virtues, demonstrating limited alignment with non- Western parenting philosophies
-
[12]
Shallow Reasoning in Religious ContextsFor acts of ritualized altruism, such as distributing Qurbani meat to the poor, models fail to capture faith-driven moral significance. These errors indicate poor contextual grounding in domain-specific religious ethics, with models treating deeply symbolic behaviors as mundane events
-
[13]
Surface-Level Pattern RelianceAcross multiple domains, models often rely on lexical or surface cues (e.g., “walked,” “school”) without considering context or intention. This leads to both false negatives in virtue recognition and false positives in ethical violations, reflecting an overreliance on statistical co-occurrence rather than semantic comprehension
-
[14]
Gender and Social Hierarchy BiasesModels reproduce embedded societal hierarchies in family and social scenarios. Mislabeling unethical prioritization of male family members illustrates systemic bias inherited from training corpora, affecting the fairness and moral alignment of predictions
-
[15]
Limited Cross-Domain GeneralizationEven when models correctly recognize virtue in one domain (e.g., generosity), they fail in structurally similar contexts (e.g., respect for elders), indicating insufficient abstraction of moral principles across cultural and situational boundaries. 26 BengaliMoralBench E.2. Root Causes of Errors From the qualitative revi...
-
[16]
Cultural Context Gap:Predominantly Western training data limits understanding of South Asian- specific moral codes, leading to misinterpretation of local virtues
-
[17]
Surface-Level Lexical Reliance:Models depend heavily on keywords rather than reasoning over intentions or outcomes, producing brittle ethical judgments
-
[18]
Lack of Religious and Ritual Awareness:Insufficient exposure to culturally embedded religious practices prevents accurate inference of ritualized ethical behavior
-
[19]
Social Bias Propagation:Prevalent societal hierarchies (gender, age, family roles) in training corpora bias model outputs, undermining justice-oriented reasoning
-
[20]
Limited Moral Abstraction Across Domains:Models struggle to generalize principles of virtue, justice, or altruism to contexts structurally different from those seen in training. E.3. Potential Solutions To address these limitations, we propose: • Culturally Grounded Pretraining:Incorporate Bengali and South Asian ethical texts, folklore, and religious mat...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.