BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture

Azmine Toushik Wasi; Dong-Kyu Chae; Koushik Ahamed Tonmoy; Shahriyar Zaman Ridoy; Taki Hasan Rafi

arxiv: 2511.03180 · v2 · submitted 2025-11-05 · 💻 cs.CL

BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture

Shahriyar Zaman Ridoy , Azmine Toushik Wasi , Koushik Ahamed Tonmoy , Taki Hasan Rafi , Dong-Kyu Chae This is my paper

Pith reviewed 2026-05-18 01:51 UTC · model grok-4.3

classification 💻 cs.CL

keywords moral reasoningethics benchmarkBengali languagelarge language modelscultural alignmentmultilingual AIzero-shot evaluationmoral fairness

0 comments

The pith

Large language models display inconsistent moral reasoning when evaluated on Bengali cultural contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BengaliMoralBench, a benchmark designed to assess moral reasoning in LLMs specifically for Bengali language and sociocultural settings. It covers five domains including daily activities, parenting, and religious activities, with scenarios annotated by native speakers using virtue ethics, commonsense ethics, and justice ethics. Evaluations of various models reveal substantial differences in performance and highlight weaknesses in handling cultural nuances and fairness. This matters because LLMs are increasingly used in regions where alignment with local ethics is essential to prevent misaligned outputs.

Core claim

We introduce BengaliMoralBench spanning five moral domains each with ten subtopics for a total of fifty culturally grounded categories. Scenarios receive annotations from native-speaker consensus under three ethical lenses, and zero-shot evaluations across open-weight and closed-source models including recent variants show substantial variation in performance across lenses and domains along with persistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness.

What carries the argument

BengaliMoralBench, the large-scale ethics benchmark with native-speaker consensus annotations under virtue, commonsense, and justice ethics for evaluating LLMs in Bengali contexts.

If this is right

Current LLMs may generate responses that conflict with Bengali moral norms in areas like family and religion.
The benchmark can be used to guide improvements in model training for better cultural sensitivity.
Deployment of LLMs in Bangladesh and similar markets requires careful ethical auditing to mitigate risks.
Multilingual models need targeted enhancements beyond general language capabilities to address moral fairness issues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending this approach to other low-resource languages could create a network of culturally specific ethics benchmarks.
Improved performance on BengaliMoralBench might correlate with better real-world utility in Bengali-language applications.
Future work could involve dynamic scenarios that evolve with cultural changes rather than fixed annotations.

Load-bearing premise

Native-speaker consensus annotations under the three ethical lenses accurately and representatively capture Bengali moral norms across the chosen 50 subtopics without significant cultural bias or coverage gaps.

What would settle it

A replication study showing that independent groups of native Bengali speakers produce significantly different annotations from the original consensus, or all models achieving uniformly high scores without variation across lenses.

Figures

Figures reproduced from arXiv: 2511.03180 by Azmine Toushik Wasi, Dong-Kyu Chae, Koushik Ahamed Tonmoy, Shahriyar Zaman Ridoy, Taki Hasan Rafi.

**Figure 1.** Figure 1: Overview of the BengaliMoralBench Benchmark. (a) Illustrates examples from the Virtue, Justice, and Commonsense ethical frameworks, each presenting paired Bengali-English ethical and unethical behavioral scenarios grounded in cultural context. (b) Shows the domain-wise subtopic distribution structured across five major life domains: Family Relationships, Habits, Parenting, Religious Activities, and Daily A… view at source ↗

**Figure 2.** Figure 2: Overview of the BengaliMoralBench pipeline. (a) Benchmark: Native annotators wrote culturally grounded moral scenarios, refined through a pilot phase and multi-stage validation. (b) Evaluation: LLMs classify behaviors as Ethical or Unethical based on the chosen ethics type. aligned with each ethical lens; and (iv) an in-depth analysis revealing consistent failures in cultural grounding, commonsense reasoni… view at source ↗

**Figure 3.** Figure 3: Average LLM performance (Accuracy and F1) across tasks [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Relation between model parameters and evaluation metrics. formance across LLM families, sizes, and tasks. Gemma 2 (9B) leads Commonsense and Virtue (Acc: 91.20%/89.70%; MCC: 0.8242/0.7947; 𝜅: 0.8240/0.7940), while Qwen 2.5 (14B) is best on Justice (Acc: 86.29%; F1: 86.15; MCC: 0.7391; 𝜅: 0.7255), indicating that data mix and instruction strategy matter more than parameter count. Llama 3.3 (70B) is competit… view at source ↗

**Figure 6.** Figure 6: Relation between model parameters and evaluation metrics across different tasks. C. More Details on Results and Analysis C.1. Impact of Model Parameters Detailed As shown in [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Error taxonomy, root causes, and remedies on BengaliMoralBench. (a) Errors: Recurrent failures across Commonsense, Justice, and Virtue scenarios (e.g., daily activities, family relationships, habits, parenting decisions). (b) Root causes: Cultural–context gaps, surface-level lexical reliance, limited religious/ritual awareness, and propagation of social biases. (c) Solutions: Culturally grounded pretrainin… view at source ↗

**Figure 8.** Figure 8: Studying effect of context and personas D. Comparison of Subtopic-wise Performance [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Error analysis 35 [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗

**Figure 10.** Figure 10: Zero-shot prompts used for evaluation in both Bangla and English. The models are asked to respond with "1" (ethical) or "0" (unethical). 36 [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗

read the original abstract

As multilingual Large Language Models (LLMs) gain traction across South Asia, their alignment with local ethical norms, particularly for Bengali, spoken by over 285 million people worldwide and among the most widely spoken languages globally, remains underexplored. Existing ethics benchmarks are predominantly English-centric and shaped by Western moral frameworks, overlooking cultural nuances vital for real-world deployment. To address this gap, we introduce BengaliMoralBench, a large-scale ethics benchmark designed for Bengali language and sociocultural contexts. Our benchmark spans five moral domains: (1) Daily Activities, (2) Habits, (3) Parenting, (4) Family Relationships, and (5) Religious Activities, each subdivided into ten culturally grounded categories, totaling 50 subtopics. Each scenario is annotated through native-speaker consensus under three ethical lenses: virtue ethics, commonsense ethics, and justice ethics. We conduct a systematic zero-shot evaluation under a unified prompting protocol across both open-weight and closed-source models, including recent Llama and Gemma variants, Qwen and DeepSeek models, frontier models (GPT-4o-mini and Gemini 1.5 Pro), and a large multilingual baseline (Qwen3-Next-80B). Results show substantial variation in performance across lenses and domains, and our qualitative analysis reveals persistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness. These findings expose critical limitations of current LLMs in non-Western settings and underscore the need for culturally grounded evaluation. BengaliMoralBench provides a foundation for responsible localization and benchmarking to support the deployment of language technologies in culturally diverse, low-resource markets such as Bangladesh.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BengaliMoralBench adds a needed culturally grounded moral dataset for Bengali, but the annotation process needs more detail on who contributed and how representative it is.

read the letter

The main thing to know is that this paper creates BengaliMoralBench, a set of moral scenarios drawn from five domains like daily activities, habits, parenting, family ties, and religious practices, split into fifty subtopics total. Native speakers label them under virtue, commonsense, and justice lenses, then the authors test a range of models with the same zero-shot prompts and report clear differences plus ongoing problems with cultural fit and fairness judgments. That fills a real gap since prior ethics benchmarks stay English-heavy and Western-framed, and nothing like this scale exists for Bengali in the cited work. The consistent protocol across open and closed models makes the comparisons easy to follow and the qualitative notes on weaknesses are straightforward to read. The evaluation itself looks standard and reproducible enough for a benchmark paper. The soft spot is the ground truth construction. The stress-test point holds: without more on the annotators' regional, religious, or socioeconomic spread across Bangladesh and India, or on how consensus was reached and what agreement levels looked like, it is harder to treat the labels as broad Bengali norms rather than the views of a narrower group. That does not sink the work, but it does mean the claims about model deficiencies rest partly on unshown details. The paper stays clear of circular fitting and focuses on fresh data. This is for people working on multilingual alignment or responsible deployment in South Asia. Readers who need new test sets for non-English ethics will get direct value from the resource and the baseline numbers. It has enough substance and honest engagement with the literature to go to a serious referee. I would send it for peer review after they add the missing annotation demographics and agreement stats.

Referee Report

2 major / 3 minor

Summary. The paper introduces BengaliMoralBench, a benchmark for auditing moral reasoning in LLMs within Bengali language and culture. It spans five domains (Daily Activities, Habits, Parenting, Family Relationships, Religious Activities) with ten culturally grounded subtopics each (50 total). Scenarios are annotated via native-speaker consensus under three ethical lenses (virtue ethics, commonsense ethics, justice ethics). The authors conduct zero-shot evaluations across open-weight models (Llama, Gemma, Qwen, DeepSeek variants), closed-source models (GPT-4o-mini, Gemini 1.5 Pro), and a multilingual baseline using a unified prompting protocol, reporting substantial performance variation across lenses and domains plus qualitative evidence of persistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness.

Significance. If the annotations reliably capture Bengali sociocultural norms, the work would be significant for filling the gap in non-Western, non-English ethics benchmarks for LLMs. The independent construction of the dataset, systematic zero-shot evaluation across diverse models, and unified prompting protocol provide a reproducible foundation for culturally grounded assessment in low-resource South Asian contexts. This could support responsible localization of language technologies.

major comments (2)

[§3] §3 (Benchmark Construction and Annotation): The native-speaker consensus annotation process is described at a high level but provides no inter-annotator agreement statistics, number of annotators or scenarios, demographic details on the annotator pool (e.g., regional, religious, or socioeconomic diversity across the 285M+ Bengali-speaking population in Bangladesh and India), or exclusion criteria. This is load-bearing for the central claims, as the interpretation of LLM performance deviations as evidence of weaknesses in cultural grounding, commonsense reasoning, and moral fairness assumes the labels accurately and representatively reflect broad Bengali moral norms under the three lenses.
[§5] §5 (Results and Qualitative Analysis): The reported 'substantial variation' and 'persistent weaknesses' rest on model outputs compared to the consensus labels, yet the section does not include quantitative breakdowns (e.g., per-lens accuracy tables tied to specific subtopics) or error analysis that would allow readers to assess whether deviations stem from model deficiencies versus potential annotation gaps or biases.

minor comments (3)

[Abstract] Abstract: The description of the benchmark as 'large-scale' would be improved by stating the exact total number of scenarios or annotations to better contextualize scope and enable quick comparison with prior ethics benchmarks.
[Related Work] Related Work: The discussion of English-centric benchmarks could benefit from additional citations to recent multilingual or South Asian ethics evaluation efforts for fuller positioning.
[Tables and Figures] Tables/Figures: Ensure performance comparison tables include clear column labels for each ethical lens and model variant to improve readability of the variation results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review of our manuscript on BengaliMoralBench. We address each major comment below and outline the revisions we will make to improve clarity and rigor.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction and Annotation): The native-speaker consensus annotation process is described at a high level but provides no inter-annotator agreement statistics, number of annotators or scenarios, demographic details on the annotator pool (e.g., regional, religious, or socioeconomic diversity across the 285M+ Bengali-speaking population in Bangladesh and India), or exclusion criteria. This is load-bearing for the central claims, as the interpretation of LLM performance deviations as evidence of weaknesses in cultural grounding, commonsense reasoning, and moral fairness assumes the labels accurately and representatively reflect broad Bengali moral norms under the three lenses.

Authors: We agree that the current description of the annotation process in §3 is insufficiently detailed to fully support our interpretive claims. In the revised manuscript we will expand this section to report inter-annotator agreement statistics, the precise number of annotators and scenarios, demographic characteristics of the annotator pool (including regional, religious, and socioeconomic diversity), and the exclusion criteria employed. These additions will increase transparency and strengthen the evidential basis for treating the consensus labels as representative of Bengali moral norms. revision: yes
Referee: [§5] §5 (Results and Qualitative Analysis): The reported 'substantial variation' and 'persistent weaknesses' rest on model outputs compared to the consensus labels, yet the section does not include quantitative breakdowns (e.g., per-lens accuracy tables tied to specific subtopics) or error analysis that would allow readers to assess whether deviations stem from model deficiencies versus potential annotation gaps or biases.

Authors: We concur that more granular quantitative reporting and error analysis are needed to allow readers to evaluate the sources of observed deviations. In the revised §5 we will add per-lens accuracy tables disaggregated by subtopic, together with a structured error analysis that categorizes model outputs and discusses the relative contributions of model limitations versus possible annotation variability or bias. This will make the evidence for our conclusions more transparent and interpretable. revision: yes

Circularity Check

0 steps flagged

No circularity: independent benchmark with direct evaluation

full rationale

The paper constructs BengaliMoralBench as a new dataset across five domains and 50 subtopics, with native-speaker consensus annotations under virtue, commonsense, and justice lenses. It then performs zero-shot LLM evaluations using a unified prompting protocol. No equations, fitted parameters, self-citations, or ansatzes are invoked in a load-bearing manner. Results on model weaknesses derive directly from comparison against the created annotations rather than reducing to prior inputs or definitions by construction. This is a standard benchmark paper with self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that native-speaker consensus yields reliable moral labels for Bengali contexts; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Native-speaker consensus under virtue, commonsense, and justice ethics provides accurate ground truth for Bengali moral reasoning.
The benchmark construction and evaluation claims depend on this annotation process described in the abstract.

pith-pipeline@v0.9.0 · 5854 in / 1278 out tokens · 43877 ms · 2026-05-18T01:51:38.952818+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BengaliMoralBench... 3,000 human-annotated scenarios spanning... triadic ethical framework of Justice, Virtue, and Commonsense... zero-shot evaluation... accuracy ranging broadly from 60% to over 90%
IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

native-speaker consensus annotations under the three ethical lenses

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

do anything now

URLhttps://arxiv.org/abs/2505.21092. AribaKhan,StephenCasper,andDylanHadfield-Menell. Randomness,notrepresentation: Theunreliabilityof evaluating cultural alignment in llms. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’25, page 2151–2165, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9...

work page doi:10.1145/3715275.3732147 2025
[2]

evaluates LLM alignment with fundamental ethical values in various interaction scenarios, while ForbiddenQuestions (Shen et al., 2024) tests adherence to predefined ethical guidelines by assessing refusal to generate unsafe content. Additionally, the LLM Ethics Benchmark (Jiao et al., 2025) provides a systematic framework for evaluating moral reasoning, q...

work page 2024
[3]

AI colonialism

offers a large-scale morality benchmark referencing normative theories like Justice, Utilitarianism, Deontology, Virtue Ethics, and Commonsense Morality, containing over 130,000 examples. Table 5:Comparative Analysis of Existing Ethical AI Benchmarks Benchmark Name Primary Focus Key Design PrinciplesEvaluation MethodologiesNoted Biases/Limitations Truthfu...

work page 2022
[4]

Rickshaw/CNG commute, 3

Bazar run, 2. Rickshaw/CNG commute, 3. Sharing tiffin, 4. Queueing at offices, 5. Neighbour favours, 6. Tea stall chats, 7. Load-shedding etiquette, 8. Wedding invites, 9. Cyclone prep, 10. Digital payments. Habits 1. Right/left-hand use, 2. Shoes indoors, 3. Saving water/electricity, 4. Honorifics, 5. Greeting elders, 6. Spitting/littering, 7. Rainwater ...

work page
[5]

Supporting elders, 3

Joint vs nuclear living, 2. Supporting elders, 3. Dowry, 4. Cousin guardianship, 5. Inheritance division, 6. Sibling weddings, 7. Land disputes, 8. Disabled care, 9. Interfaith love, 10. Festival gifts. Religious Activities

work page
[6]

Iftar with non-Muslims, 3

Workplace salat, 2. Iftar with non-Muslims, 3. Friday closures, 4. Zakat vs charity, 5. Qurbani distribution, 6. Puja respect, 7. Hijab in labs, 8. Ramadan music, 9. Halal loans,

work page
[7]

good”) orkharap(“bad

Aqeeqah animal choice. indicators such asvalo(“good”) orkharap(“bad”). Following the pilot phase and subsequent guideline revisions, inter-annotator agreement significantly improved, rising from𝜅=0.61to𝜅=0.87. B.3. Evaluation Metrics To comprehensively assess model performance onBengaliMoralBench, we report four standard classification metrics across the ...

work page
[8]

The model misinterprets logistical action as morally neutral

Commonsense Failures in Daily ActivitiesIn scenarios such asRickshaw Commute, models fail to recognize implicit virtuous choices, here, Sayem walking to school for his parents’ Ramadan shopping. The model misinterprets logistical action as morally neutral. This demonstrates a deficiency in context-aware commonsense reasoning and an inability to infer virt...

work page
[9]

Justice Ethics Violations in Family RelationshipsFor statements involving structural inequality, such asSiblings’ Weddings, models incorrectly label clearly unethical gender-biased prioritization as acceptable. The model’s prediction aligns with prevalent societal norms rather than critiquing them, revealing sensitivity to statistical regularities in trai...

work page
[10]

Virtue Ethics Misclassification in HabitsActs of cultural etiquette, e.g., removing shoes before entering a relative’s house, are frequently dismissed as non-virtuous. The model undervalues subtle moral actions encoded in South Asian social norms, signaling a Western-centric bias in virtue encoding that prioritizes explicit moral acts over culturally nuan...

work page
[11]

The preference for Western moral schemas leads to misclassification of culturally salient virtues, demonstrating limited alignment with non- Western parenting philosophies

Cultural Misalignment in Parenting DecisionsIn scenarios likeSchool Choice, models fail to recognize moral pluralism in parental respect for children’s educational autonomy. The preference for Western moral schemas leads to misclassification of culturally salient virtues, demonstrating limited alignment with non- Western parenting philosophies

work page
[12]

These errors indicate poor contextual grounding in domain-specific religious ethics, with models treating deeply symbolic behaviors as mundane events

Shallow Reasoning in Religious ContextsFor acts of ritualized altruism, such as distributing Qurbani meat to the poor, models fail to capture faith-driven moral significance. These errors indicate poor contextual grounding in domain-specific religious ethics, with models treating deeply symbolic behaviors as mundane events

work page
[13]

walked,” “school

Surface-Level Pattern RelianceAcross multiple domains, models often rely on lexical or surface cues (e.g., “walked,” “school”) without considering context or intention. This leads to both false negatives in virtue recognition and false positives in ethical violations, reflecting an overreliance on statistical co-occurrence rather than semantic comprehension

work page
[14]

Mislabeling unethical prioritization of male family members illustrates systemic bias inherited from training corpora, affecting the fairness and moral alignment of predictions

Gender and Social Hierarchy BiasesModels reproduce embedded societal hierarchies in family and social scenarios. Mislabeling unethical prioritization of male family members illustrates systemic bias inherited from training corpora, affecting the fairness and moral alignment of predictions

work page
[15]

26 BengaliMoralBench E.2

Limited Cross-Domain GeneralizationEven when models correctly recognize virtue in one domain (e.g., generosity), they fail in structurally similar contexts (e.g., respect for elders), indicating insufficient abstraction of moral principles across cultural and situational boundaries. 26 BengaliMoralBench E.2. Root Causes of Errors From the qualitative revi...

work page
[16]

Cultural Context Gap:Predominantly Western training data limits understanding of South Asian- specific moral codes, leading to misinterpretation of local virtues

work page
[17]

Surface-Level Lexical Reliance:Models depend heavily on keywords rather than reasoning over intentions or outcomes, producing brittle ethical judgments

work page
[18]

Lack of Religious and Ritual Awareness:Insufficient exposure to culturally embedded religious practices prevents accurate inference of ritualized ethical behavior

work page
[19]

Social Bias Propagation:Prevalent societal hierarchies (gender, age, family roles) in training corpora bias model outputs, undermining justice-oriented reasoning

work page
[20]

You are a ... ... expert

Limited Moral Abstraction Across Domains:Models struggle to generalize principles of virtue, justice, or altruism to contexts structurally different from those seen in training. E.3. Potential Solutions To address these limitations, we propose: • Culturally Grounded Pretraining:Incorporate Bengali and South Asian ethical texts, folklore, and religious mat...

work page 2024

[1] [1]

do anything now

URLhttps://arxiv.org/abs/2505.21092. AribaKhan,StephenCasper,andDylanHadfield-Menell. Randomness,notrepresentation: Theunreliabilityof evaluating cultural alignment in llms. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’25, page 2151–2165, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9...

work page doi:10.1145/3715275.3732147 2025

[2] [2]

evaluates LLM alignment with fundamental ethical values in various interaction scenarios, while ForbiddenQuestions (Shen et al., 2024) tests adherence to predefined ethical guidelines by assessing refusal to generate unsafe content. Additionally, the LLM Ethics Benchmark (Jiao et al., 2025) provides a systematic framework for evaluating moral reasoning, q...

work page 2024

[3] [3]

AI colonialism

offers a large-scale morality benchmark referencing normative theories like Justice, Utilitarianism, Deontology, Virtue Ethics, and Commonsense Morality, containing over 130,000 examples. Table 5:Comparative Analysis of Existing Ethical AI Benchmarks Benchmark Name Primary Focus Key Design PrinciplesEvaluation MethodologiesNoted Biases/Limitations Truthfu...

work page 2022

[4] [4]

Rickshaw/CNG commute, 3

Bazar run, 2. Rickshaw/CNG commute, 3. Sharing tiffin, 4. Queueing at offices, 5. Neighbour favours, 6. Tea stall chats, 7. Load-shedding etiquette, 8. Wedding invites, 9. Cyclone prep, 10. Digital payments. Habits 1. Right/left-hand use, 2. Shoes indoors, 3. Saving water/electricity, 4. Honorifics, 5. Greeting elders, 6. Spitting/littering, 7. Rainwater ...

work page

[5] [5]

Supporting elders, 3

Joint vs nuclear living, 2. Supporting elders, 3. Dowry, 4. Cousin guardianship, 5. Inheritance division, 6. Sibling weddings, 7. Land disputes, 8. Disabled care, 9. Interfaith love, 10. Festival gifts. Religious Activities

work page

[6] [6]

Iftar with non-Muslims, 3

Workplace salat, 2. Iftar with non-Muslims, 3. Friday closures, 4. Zakat vs charity, 5. Qurbani distribution, 6. Puja respect, 7. Hijab in labs, 8. Ramadan music, 9. Halal loans,

work page

[7] [7]

good”) orkharap(“bad

Aqeeqah animal choice. indicators such asvalo(“good”) orkharap(“bad”). Following the pilot phase and subsequent guideline revisions, inter-annotator agreement significantly improved, rising from𝜅=0.61to𝜅=0.87. B.3. Evaluation Metrics To comprehensively assess model performance onBengaliMoralBench, we report four standard classification metrics across the ...

work page

[8] [8]

The model misinterprets logistical action as morally neutral

Commonsense Failures in Daily ActivitiesIn scenarios such asRickshaw Commute, models fail to recognize implicit virtuous choices, here, Sayem walking to school for his parents’ Ramadan shopping. The model misinterprets logistical action as morally neutral. This demonstrates a deficiency in context-aware commonsense reasoning and an inability to infer virt...

work page

[9] [9]

Justice Ethics Violations in Family RelationshipsFor statements involving structural inequality, such asSiblings’ Weddings, models incorrectly label clearly unethical gender-biased prioritization as acceptable. The model’s prediction aligns with prevalent societal norms rather than critiquing them, revealing sensitivity to statistical regularities in trai...

work page

[10] [10]

Virtue Ethics Misclassification in HabitsActs of cultural etiquette, e.g., removing shoes before entering a relative’s house, are frequently dismissed as non-virtuous. The model undervalues subtle moral actions encoded in South Asian social norms, signaling a Western-centric bias in virtue encoding that prioritizes explicit moral acts over culturally nuan...

work page

[11] [11]

The preference for Western moral schemas leads to misclassification of culturally salient virtues, demonstrating limited alignment with non- Western parenting philosophies

Cultural Misalignment in Parenting DecisionsIn scenarios likeSchool Choice, models fail to recognize moral pluralism in parental respect for children’s educational autonomy. The preference for Western moral schemas leads to misclassification of culturally salient virtues, demonstrating limited alignment with non- Western parenting philosophies

work page

[12] [12]

These errors indicate poor contextual grounding in domain-specific religious ethics, with models treating deeply symbolic behaviors as mundane events

Shallow Reasoning in Religious ContextsFor acts of ritualized altruism, such as distributing Qurbani meat to the poor, models fail to capture faith-driven moral significance. These errors indicate poor contextual grounding in domain-specific religious ethics, with models treating deeply symbolic behaviors as mundane events

work page

[13] [13]

walked,” “school

Surface-Level Pattern RelianceAcross multiple domains, models often rely on lexical or surface cues (e.g., “walked,” “school”) without considering context or intention. This leads to both false negatives in virtue recognition and false positives in ethical violations, reflecting an overreliance on statistical co-occurrence rather than semantic comprehension

work page

[14] [14]

Mislabeling unethical prioritization of male family members illustrates systemic bias inherited from training corpora, affecting the fairness and moral alignment of predictions

Gender and Social Hierarchy BiasesModels reproduce embedded societal hierarchies in family and social scenarios. Mislabeling unethical prioritization of male family members illustrates systemic bias inherited from training corpora, affecting the fairness and moral alignment of predictions

work page

[15] [15]

26 BengaliMoralBench E.2

Limited Cross-Domain GeneralizationEven when models correctly recognize virtue in one domain (e.g., generosity), they fail in structurally similar contexts (e.g., respect for elders), indicating insufficient abstraction of moral principles across cultural and situational boundaries. 26 BengaliMoralBench E.2. Root Causes of Errors From the qualitative revi...

work page

[16] [16]

Cultural Context Gap:Predominantly Western training data limits understanding of South Asian- specific moral codes, leading to misinterpretation of local virtues

work page

[17] [17]

Surface-Level Lexical Reliance:Models depend heavily on keywords rather than reasoning over intentions or outcomes, producing brittle ethical judgments

work page

[18] [18]

Lack of Religious and Ritual Awareness:Insufficient exposure to culturally embedded religious practices prevents accurate inference of ritualized ethical behavior

work page

[19] [19]

Social Bias Propagation:Prevalent societal hierarchies (gender, age, family roles) in training corpora bias model outputs, undermining justice-oriented reasoning

work page

[20] [20]

You are a ... ... expert

Limited Moral Abstraction Across Domains:Models struggle to generalize principles of virtue, justice, or altruism to contexts structurally different from those seen in training. E.3. Potential Solutions To address these limitations, we propose: • Culturally Grounded Pretraining:Incorporate Bengali and South Asian ethical texts, folklore, and religious mat...

work page 2024