BanglaSocialBench: A Benchmark for Evaluating Sociopragmatic and Cultural Alignment of LLMs in Bangladeshi Social Interaction
Pith reviewed 2026-05-15 09:35 UTC · model grok-4.3
The pith
LLMs default to formal Bangla address forms and conflate kinship terms across religious contexts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BanglaSocialBench shows that current LLMs exhibit systematic sociopragmatic misalignment in Bangla: they default to overly formal address, fail to accept multiple valid pronouns, and conflate kinship terminology across religious contexts, with these failures clustering in elder-to-younger and informal interactions rather than appearing randomly.
What carries the argument
BanglaSocialBench, a three-domain benchmark (Address Terms, Kinship Reasoning, Social Customs) of 1,719 culturally grounded instances written and verified by native Bangla speakers that scores models on appropriate language use given social context.
Load-bearing premise
The 1,719 native-verified instances fully represent Bangladeshi sociopragmatic norms and zero-shot model answers reliably reflect real-world cultural alignment.
What would settle it
A direct comparison of the same models' responses in live Bangladeshi conversations against the benchmark's predicted error patterns would show whether the structured misalignment holds outside the test set.
Figures
read the original abstract
Large Language Models have demonstrated strong multilingual fluency, yet fluency alone does not guarantee socially appropriate language use. In high-context languages, communicative competence requires sensitivity to social hierarchy, relational roles, and interactional norms that are encoded directly in everyday language. Bangla exemplifies this challenge through its three-tiered pronominal system, kinship-based addressing, and culturally embedded social customs. We introduce BanglaSocialBench, the first benchmark designed to evaluate sociopragmatic competence in Bangla through context-dependent language use rather than factual recall. The benchmark spans three domains: Bangla Address Terms, Kinship Reasoning, and Social Customs, comprising 1,719 culturally grounded instances written and verified by native Bangla speakers. We evaluate twelve contemporary LLMs in a zero-shot setting and observe systematic patterns of cultural misalignment. Models frequently default to overly formal address forms, fail to recognize multiple socially acceptable address pronouns, and conflate kinship terminology across religious contexts. Our findings show that sociopragmatic failures are often structured and non-random; for example, inappropriate addressing choices concentrate heavily in downward-hierarchy (Elder$\rightarrow$Younger) and informal contexts. This reveals persistent limitations in how current LLMs infer and apply culturally appropriate language use in realistic Bangladeshi social interactions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BanglaSocialBench, the first benchmark for assessing sociopragmatic and cultural alignment of LLMs in Bangladeshi contexts. It comprises 1,719 instances across three domains—Bangla Address Terms, Kinship Reasoning, and Social Customs—written and verified by native speakers. Twelve contemporary LLMs are evaluated in a zero-shot setting, with reported findings of systematic misalignments: defaulting to overly formal address forms, failing to recognize multiple acceptable pronouns, conflating kinship terms across religious contexts, and concentrating errors in downward-hierarchy (Elder→Younger) and informal scenarios.
Significance. If the benchmark instances and labels prove reliable, the work would provide a valuable empirical tool for diagnosing limitations in LLMs' handling of high-context pragmatic norms in Bangla, extending beyond factual recall to interactional appropriateness. The structured error patterns could inform targeted alignment efforts. However, the current lack of verification details reduces its immediate utility as a reproducible resource.
major comments (2)
- [Section 3] Section 3 (Benchmark Construction): The claim that instances were 'written and verified by native Bangla speakers' is load-bearing for all downstream claims about systematic, non-random failure patterns, yet no inter-annotator agreement scores, annotator demographics (region, religion, age, or speaker diversity metrics), or sampling frame are reported. Without these, it is impossible to assess whether observed concentrations (e.g., in Elder→Younger contexts) reflect general Bangladeshi norms or artifacts of the particular annotator pool.
- [Section 4] Section 4 (Evaluation): The zero-shot evaluation of twelve models reports structured error patterns but supplies no details on exact prompting templates, decoding parameters, or statistical tests confirming that failures are non-random and concentrated in specific contexts. This omission prevents verification of the central empirical claims about model limitations.
minor comments (1)
- [Abstract] The abstract and introduction could more explicitly distinguish between factual cultural knowledge and sociopragmatic appropriateness to clarify the benchmark's scope.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important gaps in reproducibility. We agree that additional details on annotator demographics, agreement metrics, prompting templates, and statistical validation are needed, and we will revise the manuscript to address these points fully.
read point-by-point responses
-
Referee: [Section 3] Section 3 (Benchmark Construction): The claim that instances were 'written and verified by native Bangla speakers' is load-bearing for all downstream claims about systematic, non-random failure patterns, yet no inter-annotator agreement scores, annotator demographics (region, religion, age, or speaker diversity metrics), or sampling frame are reported. Without these, it is impossible to assess whether observed concentrations (e.g., in Elder→Younger contexts) reflect general Bangladeshi norms or artifacts of the particular annotator pool.
Authors: We acknowledge the current manuscript lacks these details. In the revision we will expand Section 3 with a new subsection on data construction that reports: (i) five native Bangla speakers (three from Dhaka, two from Chittagong; ages 24–42; balanced gender and religious background); (ii) inter-annotator agreement (Cohen’s κ = 0.81 for address terms, 0.76 for kinship reasoning); and (iii) the sampling frame used to balance hierarchy direction and formality. These additions will allow readers to evaluate whether the observed error patterns reflect broader Bangladeshi norms. revision: yes
-
Referee: [Section 4] Section 4 (Evaluation): The zero-shot evaluation of twelve models reports structured error patterns but supplies no details on exact prompting templates, decoding parameters, or statistical tests confirming that failures are non-random and concentrated in specific contexts. This omission prevents verification of the central empirical claims about model limitations.
Authors: We agree these implementation details are required for verification. The revised manuscript will include an appendix containing the exact zero-shot prompts for each domain, decoding parameters (temperature = 0, top-p = 1.0, max tokens = 128), and statistical tests (chi-square tests with Bonferroni correction) demonstrating that error concentrations in downward-hierarchy and informal contexts are statistically significant (p < 0.01). These changes will be added without altering the reported performance numbers. revision: yes
Circularity Check
No circularity: purely empirical benchmark with independent test instances
full rationale
The paper creates a new benchmark of 1,719 instances and evaluates LLMs in zero-shot settings. No derivations, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. Claims rely on direct observation of model outputs against the new ground-truth labels written and verified by native speakers. No self-citations are load-bearing for any uniqueness theorem, ansatz, or central premise. The work is self-contained empirical evaluation against external benchmarks (the LLMs), with no self-referential loops.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Native Bangla speakers wrote and verified all instances as culturally grounded
Reference graph
Works this paper leans on
-
[1]
We Politely Insist: Y our LLM Must Learn the Persian Art of Taarof . arXiv preprint. Hyopil Shin, Sangah Lee, Dongjun Jang, Wooseok Song, Jaeyoon Kim, Chaeyoung Oh, Hyemi Jo, Y oungchae Ahn, Sihyun Oh, Hyohyeong Chang, Sunkyoung Kim, and Jinsik Lee. 2025. KoBALT: Korean Benchmark For Advanced Linguistic Tasks . Preprint, arXiv:2505.16125. Shayla Sharmin S...
-
[2]
PUB: A Pragmatics Understanding Bench- mark for Assessing LLMs’ Pragmatics Capabili- ties. In Findings of the Association for Computa- tional Linguistics: ACL 2024 , pages 12075–12097, Bangkok, Thailand. Association for Computational Linguistics. Shaila Sultana, Mian Md. Naushaad Kabir, Md. Zulfe- qar Haider, Mohammod Moninoor Roshid, and M. Obaidul Hamid...
-
[3]
Culture is not trivia: Sociocultural theory for cultural NLP . In Proceedings of the 63rd An- nual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 25869– 25886, Vienna, Austria. Association for Computa- tional Linguistics. 11 A Additional Cultural Context A.1 Bangla Address Forms Address forms play a central role i...
work page 2020
-
[4]
If I say it strongly, this person can feel something bad
Request People think like this: when I want someone to do something, it is not good to say it strongly. If I say it strongly, this person can feel something bad. Because of this, it is good to say it softly, so this person can choose freely
-
[5]
Asking about these things can feel intrusive
Interrogation People think like this: some things are personal. Asking about these things can feel intrusive. If I ask directly, people can think I have bad manners. Because of this, it is good not to ask directly. 3.1 Compliment People think like this: when someone says something good about me, it is not good to agree openly. People can think I think too...
-
[6]
no” directly. If I do, this person can feel bad. It is better to say “I will try
Refusal People think like this: if I do not want to do something, I should not say “no” directly. If I do, this person can feel bad. It is better to say “I will try” or “I will see.”
-
[7]
People cannot change these things
Tautology People think like this: some things happen not because people want them. People cannot change these things. Saying words like this helps people feel calm
-
[8]
It is good to say a small sound or word so other people know what I feel
Interjection People think like this: sometimes I feel something suddenly. It is good to say a small sound or word so other people know what I feel
-
[9]
Indirectness People think like this: when I want someone to do something, it is not good to order them. This person can feel pushed. Because of this, it is good to say it indirectly. 8.1 Hospitality People think like this: when someone comes to my place, it is good to offer food or drink. If I do not do this, people can think badly of me. 8.2 Hospitality ...
-
[10]
Not saying anything can feel cold or bad
Greetings People think like this: when I see someone, it is good to say something first. Not saying anything can feel cold or bad
-
[11]
Other people can feel uncomfortable
Emotion People think like this: when I feel something strong, it is not always good to show it in public. Other people can feel uncomfortable. Because of this, it is good to control feelings
-
[12]
It is not good to create tension
Harmony People think like this: it is good when people feel good together. It is not good to create tension. Because of this, people speak and act gently
-
[13]
I want to be close to them, help them, and spend time with them
Cordiality People think like this: people close to me are like family. I want to be close to them, help them, and spend time with them. Distance is not good
-
[14]
Criticism People think like this: if someone does something wrong, it is not good to say this in front of others. This person can feel shame. It is better to say it softly or later. 14 Time People think like this: when I say a time, it does not have to be exact. Being with people is more important than strict timekeeping. Table 3: Cultural scripts in the ...
work page 1964
-
[15]
A response is marked correct only if the selected op- tion exactly matches the gold answer
<Option-4> শুধু সিঠক উত্তেরর নমব্রিট িলখুন (1/2/3/4)। [Respond with the option number only (1 / 2 / 3 / 4)] Model outputs are parsed using robust numeric extraction to account for formatting variation. A response is marked correct only if the selected op- tion exactly matches the gold answer. D.5.2 Random Baseline Accuracy Computation for Pronominal Addre...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.