pith. sign in

arxiv: 2511.03180 · v2 · submitted 2025-11-05 · 💻 cs.CL

BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture

Pith reviewed 2026-05-18 01:51 UTC · model grok-4.3

classification 💻 cs.CL
keywords moral reasoningethics benchmarkBengali languagelarge language modelscultural alignmentmultilingual AIzero-shot evaluationmoral fairness
0
0 comments X

The pith

Large language models display inconsistent moral reasoning when evaluated on Bengali cultural contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BengaliMoralBench, a benchmark designed to assess moral reasoning in LLMs specifically for Bengali language and sociocultural settings. It covers five domains including daily activities, parenting, and religious activities, with scenarios annotated by native speakers using virtue ethics, commonsense ethics, and justice ethics. Evaluations of various models reveal substantial differences in performance and highlight weaknesses in handling cultural nuances and fairness. This matters because LLMs are increasingly used in regions where alignment with local ethics is essential to prevent misaligned outputs.

Core claim

We introduce BengaliMoralBench spanning five moral domains each with ten subtopics for a total of fifty culturally grounded categories. Scenarios receive annotations from native-speaker consensus under three ethical lenses, and zero-shot evaluations across open-weight and closed-source models including recent variants show substantial variation in performance across lenses and domains along with persistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness.

What carries the argument

BengaliMoralBench, the large-scale ethics benchmark with native-speaker consensus annotations under virtue, commonsense, and justice ethics for evaluating LLMs in Bengali contexts.

If this is right

  • Current LLMs may generate responses that conflict with Bengali moral norms in areas like family and religion.
  • The benchmark can be used to guide improvements in model training for better cultural sensitivity.
  • Deployment of LLMs in Bangladesh and similar markets requires careful ethical auditing to mitigate risks.
  • Multilingual models need targeted enhancements beyond general language capabilities to address moral fairness issues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending this approach to other low-resource languages could create a network of culturally specific ethics benchmarks.
  • Improved performance on BengaliMoralBench might correlate with better real-world utility in Bengali-language applications.
  • Future work could involve dynamic scenarios that evolve with cultural changes rather than fixed annotations.

Load-bearing premise

Native-speaker consensus annotations under the three ethical lenses accurately and representatively capture Bengali moral norms across the chosen 50 subtopics without significant cultural bias or coverage gaps.

What would settle it

A replication study showing that independent groups of native Bengali speakers produce significantly different annotations from the original consensus, or all models achieving uniformly high scores without variation across lenses.

Figures

Figures reproduced from arXiv: 2511.03180 by Azmine Toushik Wasi, Dong-Kyu Chae, Koushik Ahamed Tonmoy, Shahriyar Zaman Ridoy, Taki Hasan Rafi.

Figure 1
Figure 1. Figure 1: Overview of the BengaliMoralBench Benchmark. (a) Illustrates examples from the Virtue, Justice, and Commonsense ethical frameworks, each presenting paired Bengali-English ethical and unethical behavioral scenarios grounded in cultural context. (b) Shows the domain-wise subtopic distribution structured across five major life domains: Family Relationships, Habits, Parenting, Religious Activities, and Daily A… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the BengaliMoralBench pipeline. (a) Benchmark: Native annotators wrote culturally grounded moral scenarios, refined through a pilot phase and multi-stage validation. (b) Evaluation: LLMs classify behaviors as Ethical or Unethical based on the chosen ethics type. aligned with each ethical lens; and (iv) an in-depth analysis revealing consistent failures in cultural grounding, commonsense reasoni… view at source ↗
Figure 3
Figure 3. Figure 3: Average LLM performance (Accuracy and F1) across tasks [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Relation between model parameters and evaluation metrics. formance across LLM families, sizes, and tasks. Gemma 2 (9B) leads Commonsense and Virtue (Acc: 91.20%/89.70%; MCC: 0.8242/0.7947; 𝜅: 0.8240/0.7940), while Qwen 2.5 (14B) is best on Justice (Acc: 86.29%; F1: 86.15; MCC: 0.7391; 𝜅: 0.7255), indicating that data mix and instruction strategy matter more than parameter count. Llama 3.3 (70B) is competit… view at source ↗
Figure 6
Figure 6. Figure 6: Relation between model parameters and evaluation metrics across different tasks. C. More Details on Results and Analysis C.1. Impact of Model Parameters Detailed As shown in [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Error taxonomy, root causes, and remedies on BengaliMoralBench. (a) Errors: Recurrent failures across Commonsense, Justice, and Virtue scenarios (e.g., daily activities, family relationships, habits, parenting decisions). (b) Root causes: Cultural–context gaps, surface-level lexical reliance, limited religious/ritual awareness, and propagation of social biases. (c) Solutions: Culturally grounded pretrainin… view at source ↗
Figure 8
Figure 8. Figure 8: Studying effect of context and personas D. Comparison of Subtopic-wise Performance [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Error analysis 35 [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Zero-shot prompts used for evaluation in both Bangla and English. The models are asked to respond with "1" (ethical) or "0" (unethical). 36 [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗
read the original abstract

As multilingual Large Language Models (LLMs) gain traction across South Asia, their alignment with local ethical norms, particularly for Bengali, spoken by over 285 million people worldwide and among the most widely spoken languages globally, remains underexplored. Existing ethics benchmarks are predominantly English-centric and shaped by Western moral frameworks, overlooking cultural nuances vital for real-world deployment. To address this gap, we introduce BengaliMoralBench, a large-scale ethics benchmark designed for Bengali language and sociocultural contexts. Our benchmark spans five moral domains: (1) Daily Activities, (2) Habits, (3) Parenting, (4) Family Relationships, and (5) Religious Activities, each subdivided into ten culturally grounded categories, totaling 50 subtopics. Each scenario is annotated through native-speaker consensus under three ethical lenses: virtue ethics, commonsense ethics, and justice ethics. We conduct a systematic zero-shot evaluation under a unified prompting protocol across both open-weight and closed-source models, including recent Llama and Gemma variants, Qwen and DeepSeek models, frontier models (GPT-4o-mini and Gemini 1.5 Pro), and a large multilingual baseline (Qwen3-Next-80B). Results show substantial variation in performance across lenses and domains, and our qualitative analysis reveals persistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness. These findings expose critical limitations of current LLMs in non-Western settings and underscore the need for culturally grounded evaluation. BengaliMoralBench provides a foundation for responsible localization and benchmarking to support the deployment of language technologies in culturally diverse, low-resource markets such as Bangladesh.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces BengaliMoralBench, a benchmark for auditing moral reasoning in LLMs within Bengali language and culture. It spans five domains (Daily Activities, Habits, Parenting, Family Relationships, Religious Activities) with ten culturally grounded subtopics each (50 total). Scenarios are annotated via native-speaker consensus under three ethical lenses (virtue ethics, commonsense ethics, justice ethics). The authors conduct zero-shot evaluations across open-weight models (Llama, Gemma, Qwen, DeepSeek variants), closed-source models (GPT-4o-mini, Gemini 1.5 Pro), and a multilingual baseline using a unified prompting protocol, reporting substantial performance variation across lenses and domains plus qualitative evidence of persistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness.

Significance. If the annotations reliably capture Bengali sociocultural norms, the work would be significant for filling the gap in non-Western, non-English ethics benchmarks for LLMs. The independent construction of the dataset, systematic zero-shot evaluation across diverse models, and unified prompting protocol provide a reproducible foundation for culturally grounded assessment in low-resource South Asian contexts. This could support responsible localization of language technologies.

major comments (2)
  1. [§3] §3 (Benchmark Construction and Annotation): The native-speaker consensus annotation process is described at a high level but provides no inter-annotator agreement statistics, number of annotators or scenarios, demographic details on the annotator pool (e.g., regional, religious, or socioeconomic diversity across the 285M+ Bengali-speaking population in Bangladesh and India), or exclusion criteria. This is load-bearing for the central claims, as the interpretation of LLM performance deviations as evidence of weaknesses in cultural grounding, commonsense reasoning, and moral fairness assumes the labels accurately and representatively reflect broad Bengali moral norms under the three lenses.
  2. [§5] §5 (Results and Qualitative Analysis): The reported 'substantial variation' and 'persistent weaknesses' rest on model outputs compared to the consensus labels, yet the section does not include quantitative breakdowns (e.g., per-lens accuracy tables tied to specific subtopics) or error analysis that would allow readers to assess whether deviations stem from model deficiencies versus potential annotation gaps or biases.
minor comments (3)
  1. [Abstract] Abstract: The description of the benchmark as 'large-scale' would be improved by stating the exact total number of scenarios or annotations to better contextualize scope and enable quick comparison with prior ethics benchmarks.
  2. [Related Work] Related Work: The discussion of English-centric benchmarks could benefit from additional citations to recent multilingual or South Asian ethics evaluation efforts for fuller positioning.
  3. [Tables and Figures] Tables/Figures: Ensure performance comparison tables include clear column labels for each ethical lens and model variant to improve readability of the variation results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review of our manuscript on BengaliMoralBench. We address each major comment below and outline the revisions we will make to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction and Annotation): The native-speaker consensus annotation process is described at a high level but provides no inter-annotator agreement statistics, number of annotators or scenarios, demographic details on the annotator pool (e.g., regional, religious, or socioeconomic diversity across the 285M+ Bengali-speaking population in Bangladesh and India), or exclusion criteria. This is load-bearing for the central claims, as the interpretation of LLM performance deviations as evidence of weaknesses in cultural grounding, commonsense reasoning, and moral fairness assumes the labels accurately and representatively reflect broad Bengali moral norms under the three lenses.

    Authors: We agree that the current description of the annotation process in §3 is insufficiently detailed to fully support our interpretive claims. In the revised manuscript we will expand this section to report inter-annotator agreement statistics, the precise number of annotators and scenarios, demographic characteristics of the annotator pool (including regional, religious, and socioeconomic diversity), and the exclusion criteria employed. These additions will increase transparency and strengthen the evidential basis for treating the consensus labels as representative of Bengali moral norms. revision: yes

  2. Referee: [§5] §5 (Results and Qualitative Analysis): The reported 'substantial variation' and 'persistent weaknesses' rest on model outputs compared to the consensus labels, yet the section does not include quantitative breakdowns (e.g., per-lens accuracy tables tied to specific subtopics) or error analysis that would allow readers to assess whether deviations stem from model deficiencies versus potential annotation gaps or biases.

    Authors: We concur that more granular quantitative reporting and error analysis are needed to allow readers to evaluate the sources of observed deviations. In the revised §5 we will add per-lens accuracy tables disaggregated by subtopic, together with a structured error analysis that categorizes model outputs and discusses the relative contributions of model limitations versus possible annotation variability or bias. This will make the evidence for our conclusions more transparent and interpretable. revision: yes

Circularity Check

0 steps flagged

No circularity: independent benchmark with direct evaluation

full rationale

The paper constructs BengaliMoralBench as a new dataset across five domains and 50 subtopics, with native-speaker consensus annotations under virtue, commonsense, and justice lenses. It then performs zero-shot LLM evaluations using a unified prompting protocol. No equations, fitted parameters, self-citations, or ansatzes are invoked in a load-bearing manner. Results on model weaknesses derive directly from comparison against the created annotations rather than reducing to prior inputs or definitions by construction. This is a standard benchmark paper with self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that native-speaker consensus yields reliable moral labels for Bengali contexts; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Native-speaker consensus under virtue, commonsense, and justice ethics provides accurate ground truth for Bengali moral reasoning.
    The benchmark construction and evaluation claims depend on this annotation process described in the abstract.

pith-pipeline@v0.9.0 · 5854 in / 1278 out tokens · 43877 ms · 2026-05-18T01:51:38.952818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    do anything now

    URLhttps://arxiv.org/abs/2505.21092. AribaKhan,StephenCasper,andDylanHadfield-Menell. Randomness,notrepresentation: Theunreliabilityof evaluating cultural alignment in llms. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’25, page 2151–2165, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9...

  2. [2]

    evaluates LLM alignment with fundamental ethical values in various interaction scenarios, while ForbiddenQuestions (Shen et al., 2024) tests adherence to predefined ethical guidelines by assessing refusal to generate unsafe content. Additionally, the LLM Ethics Benchmark (Jiao et al., 2025) provides a systematic framework for evaluating moral reasoning, q...

  3. [3]

    AI colonialism

    offers a large-scale morality benchmark referencing normative theories like Justice, Utilitarianism, Deontology, Virtue Ethics, and Commonsense Morality, containing over 130,000 examples. Table 5:Comparative Analysis of Existing Ethical AI Benchmarks Benchmark Name Primary Focus Key Design PrinciplesEvaluation MethodologiesNoted Biases/Limitations Truthfu...

  4. [4]

    Rickshaw/CNG commute, 3

    Bazar run, 2. Rickshaw/CNG commute, 3. Sharing tiffin, 4. Queueing at offices, 5. Neighbour favours, 6. Tea stall chats, 7. Load-shedding etiquette, 8. Wedding invites, 9. Cyclone prep, 10. Digital payments. Habits 1. Right/left-hand use, 2. Shoes indoors, 3. Saving water/electricity, 4. Honorifics, 5. Greeting elders, 6. Spitting/littering, 7. Rainwater ...

  5. [5]

    Supporting elders, 3

    Joint vs nuclear living, 2. Supporting elders, 3. Dowry, 4. Cousin guardianship, 5. Inheritance division, 6. Sibling weddings, 7. Land disputes, 8. Disabled care, 9. Interfaith love, 10. Festival gifts. Religious Activities

  6. [6]

    Iftar with non-Muslims, 3

    Workplace salat, 2. Iftar with non-Muslims, 3. Friday closures, 4. Zakat vs charity, 5. Qurbani distribution, 6. Puja respect, 7. Hijab in labs, 8. Ramadan music, 9. Halal loans,

  7. [7]

    good”) orkharap(“bad

    Aqeeqah animal choice. indicators such asvalo(“good”) orkharap(“bad”). Following the pilot phase and subsequent guideline revisions, inter-annotator agreement significantly improved, rising from𝜅=0.61to𝜅=0.87. B.3. Evaluation Metrics To comprehensively assess model performance onBengaliMoralBench, we report four standard classification metrics across the ...

  8. [8]

    The model misinterprets logistical action as morally neutral

    Commonsense Failures in Daily ActivitiesIn scenarios such asRickshaw Commute, models fail to recognize implicit virtuous choices, here, Sayem walking to school for his parents’ Ramadan shopping. The model misinterprets logistical action as morally neutral. This demonstrates a deficiency in context-aware commonsense reasoning and an inability to infer virt...

  9. [9]

    Justice Ethics Violations in Family RelationshipsFor statements involving structural inequality, such asSiblings’ Weddings, models incorrectly label clearly unethical gender-biased prioritization as acceptable. The model’s prediction aligns with prevalent societal norms rather than critiquing them, revealing sensitivity to statistical regularities in trai...

  10. [10]

    Virtue Ethics Misclassification in HabitsActs of cultural etiquette, e.g., removing shoes before entering a relative’s house, are frequently dismissed as non-virtuous. The model undervalues subtle moral actions encoded in South Asian social norms, signaling a Western-centric bias in virtue encoding that prioritizes explicit moral acts over culturally nuan...

  11. [11]

    The preference for Western moral schemas leads to misclassification of culturally salient virtues, demonstrating limited alignment with non- Western parenting philosophies

    Cultural Misalignment in Parenting DecisionsIn scenarios likeSchool Choice, models fail to recognize moral pluralism in parental respect for children’s educational autonomy. The preference for Western moral schemas leads to misclassification of culturally salient virtues, demonstrating limited alignment with non- Western parenting philosophies

  12. [12]

    These errors indicate poor contextual grounding in domain-specific religious ethics, with models treating deeply symbolic behaviors as mundane events

    Shallow Reasoning in Religious ContextsFor acts of ritualized altruism, such as distributing Qurbani meat to the poor, models fail to capture faith-driven moral significance. These errors indicate poor contextual grounding in domain-specific religious ethics, with models treating deeply symbolic behaviors as mundane events

  13. [13]

    walked,” “school

    Surface-Level Pattern RelianceAcross multiple domains, models often rely on lexical or surface cues (e.g., “walked,” “school”) without considering context or intention. This leads to both false negatives in virtue recognition and false positives in ethical violations, reflecting an overreliance on statistical co-occurrence rather than semantic comprehension

  14. [14]

    Mislabeling unethical prioritization of male family members illustrates systemic bias inherited from training corpora, affecting the fairness and moral alignment of predictions

    Gender and Social Hierarchy BiasesModels reproduce embedded societal hierarchies in family and social scenarios. Mislabeling unethical prioritization of male family members illustrates systemic bias inherited from training corpora, affecting the fairness and moral alignment of predictions

  15. [15]

    26 BengaliMoralBench E.2

    Limited Cross-Domain GeneralizationEven when models correctly recognize virtue in one domain (e.g., generosity), they fail in structurally similar contexts (e.g., respect for elders), indicating insufficient abstraction of moral principles across cultural and situational boundaries. 26 BengaliMoralBench E.2. Root Causes of Errors From the qualitative revi...

  16. [16]

    Cultural Context Gap:Predominantly Western training data limits understanding of South Asian- specific moral codes, leading to misinterpretation of local virtues

  17. [17]

    Surface-Level Lexical Reliance:Models depend heavily on keywords rather than reasoning over intentions or outcomes, producing brittle ethical judgments

  18. [18]

    Lack of Religious and Ritual Awareness:Insufficient exposure to culturally embedded religious practices prevents accurate inference of ritualized ethical behavior

  19. [19]

    Social Bias Propagation:Prevalent societal hierarchies (gender, age, family roles) in training corpora bias model outputs, undermining justice-oriented reasoning

  20. [20]

    You are a ... ... expert

    Limited Moral Abstraction Across Domains:Models struggle to generalize principles of virtue, justice, or altruism to contexts structurally different from those seen in training. E.3. Potential Solutions To address these limitations, we propose: • Culturally Grounded Pretraining:Incorporate Bengali and South Asian ethical texts, folklore, and religious mat...