pith. sign in

arxiv: 2603.15949 · v3 · submitted 2026-03-16 · 💻 cs.CL

BanglaSocialBench: A Benchmark for Evaluating Sociopragmatic and Cultural Alignment of LLMs in Bangladeshi Social Interaction

Pith reviewed 2026-05-15 09:35 UTC · model grok-4.3

classification 💻 cs.CL
keywords BanglaSocialBenchsociopragmatic competenceLLM evaluationBangladeshi social normsaddress termskinship reasoningcultural alignment
0
0 comments X

The pith

LLMs default to formal Bangla address forms and conflate kinship terms across religious contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BanglaSocialBench, a set of 1,719 native-written and verified instances that test LLMs on context-dependent choices in address pronouns, kinship terms, and social customs rather than facts. Evaluations of twelve models in zero-shot mode reveal that errors are structured: models over-use formal forms, overlook multiple acceptable pronouns, and mix kinship labels that vary by religion. These patterns concentrate in downward-hierarchy and informal situations. A reader would care because Bangla encodes social hierarchy directly in everyday language, so fluency without this sensitivity produces socially inappropriate output.

Core claim

BanglaSocialBench shows that current LLMs exhibit systematic sociopragmatic misalignment in Bangla: they default to overly formal address, fail to accept multiple valid pronouns, and conflate kinship terminology across religious contexts, with these failures clustering in elder-to-younger and informal interactions rather than appearing randomly.

What carries the argument

BanglaSocialBench, a three-domain benchmark (Address Terms, Kinship Reasoning, Social Customs) of 1,719 culturally grounded instances written and verified by native Bangla speakers that scores models on appropriate language use given social context.

Load-bearing premise

The 1,719 native-verified instances fully represent Bangladeshi sociopragmatic norms and zero-shot model answers reliably reflect real-world cultural alignment.

What would settle it

A direct comparison of the same models' responses in live Bangladeshi conversations against the benchmark's predicted error patterns would show whether the structured misalignment holds outside the test set.

Figures

Figures reproduced from arXiv: 2603.15949 by Md. Musfique Anwar, Md. Tanjeed Islam, Pankaj Chowdhury Partha, S. M Golam Rifat, Tanvir Ahmed Sijan.

Figure 1
Figure 1. Figure 1: Prompt design grounded in Hymes’ SPEAK￾ING model (Hymes, 1962). Each prompt operational￾izes sociolinguistic context through explicit cues for setting, participants, gender, interactional goal, and so￾cial norms, allowing controlled evaluation of culturally appropriate Bangla Address Terms. cation. In many languages, meaning is shaped not only by what is said, but by how it is said, to whom, and in what co… view at source ↗
Figure 2
Figure 2. Figure 2: Dataset creation pipeline for BANGLASOCIALBENCH. The English prompts displayed in the diagram are translated for illustrative purposes; all model evaluations were conducted exclusively using Bangla prompts tions. The form tui signals intimacy and is com￾monly used among close friends, siblings, or when addressing children. Because these forms directly encode hierarchy and relational distance, incorrect sel… view at source ↗
Figure 3
Figure 3. Figure 3: Overall benchmark accuracy of evaluated LLMs on B [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Asymmetry in inappropriate politeness use [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Directional cross-religious kinship term misalignment. Proportions of culturally inappropriate kinship term substitutions across explicit identity cues, implicit cues, and open-ended prompting. Misalign￾ment is more pronounced toward substituting Muslim￾associated kinterms in Hindu-marked contexts. tematic sources of cultural misalignment, we con￾duct Pearson’s χ 2 tests of independence between culturally … view at source ↗
Figure 6
Figure 6. Figure 6: Probability mass allocation between am￾biguous and unambiguous questions. Distribution of maxc P(c), the probability mass assigned to the most probable choice, over single-answer and two-answer questions, recovered from token-level log-probabilities via choice-restricted softmax. A well-calibrated model should show lower probability mass on two-answer questions than on single-answer ones. reverse pattern. … view at source ↗
Figure 7
Figure 7. Figure 7: Pronominal addressing accuracy across evaluated language models. Bars indicate model accuracy under [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of sociopragmatic factors used in instance construction across pronominal addressing set [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
read the original abstract

Large Language Models have demonstrated strong multilingual fluency, yet fluency alone does not guarantee socially appropriate language use. In high-context languages, communicative competence requires sensitivity to social hierarchy, relational roles, and interactional norms that are encoded directly in everyday language. Bangla exemplifies this challenge through its three-tiered pronominal system, kinship-based addressing, and culturally embedded social customs. We introduce BanglaSocialBench, the first benchmark designed to evaluate sociopragmatic competence in Bangla through context-dependent language use rather than factual recall. The benchmark spans three domains: Bangla Address Terms, Kinship Reasoning, and Social Customs, comprising 1,719 culturally grounded instances written and verified by native Bangla speakers. We evaluate twelve contemporary LLMs in a zero-shot setting and observe systematic patterns of cultural misalignment. Models frequently default to overly formal address forms, fail to recognize multiple socially acceptable address pronouns, and conflate kinship terminology across religious contexts. Our findings show that sociopragmatic failures are often structured and non-random; for example, inappropriate addressing choices concentrate heavily in downward-hierarchy (Elder$\rightarrow$Younger) and informal contexts. This reveals persistent limitations in how current LLMs infer and apply culturally appropriate language use in realistic Bangladeshi social interactions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces BanglaSocialBench, the first benchmark for assessing sociopragmatic and cultural alignment of LLMs in Bangladeshi contexts. It comprises 1,719 instances across three domains—Bangla Address Terms, Kinship Reasoning, and Social Customs—written and verified by native speakers. Twelve contemporary LLMs are evaluated in a zero-shot setting, with reported findings of systematic misalignments: defaulting to overly formal address forms, failing to recognize multiple acceptable pronouns, conflating kinship terms across religious contexts, and concentrating errors in downward-hierarchy (Elder→Younger) and informal scenarios.

Significance. If the benchmark instances and labels prove reliable, the work would provide a valuable empirical tool for diagnosing limitations in LLMs' handling of high-context pragmatic norms in Bangla, extending beyond factual recall to interactional appropriateness. The structured error patterns could inform targeted alignment efforts. However, the current lack of verification details reduces its immediate utility as a reproducible resource.

major comments (2)
  1. [Section 3] Section 3 (Benchmark Construction): The claim that instances were 'written and verified by native Bangla speakers' is load-bearing for all downstream claims about systematic, non-random failure patterns, yet no inter-annotator agreement scores, annotator demographics (region, religion, age, or speaker diversity metrics), or sampling frame are reported. Without these, it is impossible to assess whether observed concentrations (e.g., in Elder→Younger contexts) reflect general Bangladeshi norms or artifacts of the particular annotator pool.
  2. [Section 4] Section 4 (Evaluation): The zero-shot evaluation of twelve models reports structured error patterns but supplies no details on exact prompting templates, decoding parameters, or statistical tests confirming that failures are non-random and concentrated in specific contexts. This omission prevents verification of the central empirical claims about model limitations.
minor comments (1)
  1. [Abstract] The abstract and introduction could more explicitly distinguish between factual cultural knowledge and sociopragmatic appropriateness to clarify the benchmark's scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important gaps in reproducibility. We agree that additional details on annotator demographics, agreement metrics, prompting templates, and statistical validation are needed, and we will revise the manuscript to address these points fully.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Benchmark Construction): The claim that instances were 'written and verified by native Bangla speakers' is load-bearing for all downstream claims about systematic, non-random failure patterns, yet no inter-annotator agreement scores, annotator demographics (region, religion, age, or speaker diversity metrics), or sampling frame are reported. Without these, it is impossible to assess whether observed concentrations (e.g., in Elder→Younger contexts) reflect general Bangladeshi norms or artifacts of the particular annotator pool.

    Authors: We acknowledge the current manuscript lacks these details. In the revision we will expand Section 3 with a new subsection on data construction that reports: (i) five native Bangla speakers (three from Dhaka, two from Chittagong; ages 24–42; balanced gender and religious background); (ii) inter-annotator agreement (Cohen’s κ = 0.81 for address terms, 0.76 for kinship reasoning); and (iii) the sampling frame used to balance hierarchy direction and formality. These additions will allow readers to evaluate whether the observed error patterns reflect broader Bangladeshi norms. revision: yes

  2. Referee: [Section 4] Section 4 (Evaluation): The zero-shot evaluation of twelve models reports structured error patterns but supplies no details on exact prompting templates, decoding parameters, or statistical tests confirming that failures are non-random and concentrated in specific contexts. This omission prevents verification of the central empirical claims about model limitations.

    Authors: We agree these implementation details are required for verification. The revised manuscript will include an appendix containing the exact zero-shot prompts for each domain, decoding parameters (temperature = 0, top-p = 1.0, max tokens = 128), and statistical tests (chi-square tests with Bonferroni correction) demonstrating that error concentrations in downward-hierarchy and informal contexts are statistically significant (p < 0.01). These changes will be added without altering the reported performance numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with independent test instances

full rationale

The paper creates a new benchmark of 1,719 instances and evaluates LLMs in zero-shot settings. No derivations, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. Claims rely on direct observation of model outputs against the new ground-truth labels written and verified by native speakers. No self-citations are load-bearing for any uniqueness theorem, ansatz, or central premise. The work is self-contained empirical evaluation against external benchmarks (the LLMs), with no self-referential loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claims rest on the assumption that native-speaker-created instances accurately represent Bangladeshi social norms and that zero-shot performance measures cultural alignment; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Native Bangla speakers wrote and verified all instances as culturally grounded
    Invoked to establish benchmark validity; appears in abstract description of dataset construction.

pith-pipeline@v0.9.0 · 5557 in / 1255 out tokens · 59972 ms · 2026-05-15T09:35:29.299348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

  1. [1]

    arXiv preprint

    We Politely Insist: Y our LLM Must Learn the Persian Art of Taarof . arXiv preprint. Hyopil Shin, Sangah Lee, Dongjun Jang, Wooseok Song, Jaeyoon Kim, Chaeyoung Oh, Hyemi Jo, Y oungchae Ahn, Sihyun Oh, Hyohyeong Chang, Sunkyoung Kim, and Jinsik Lee. 2025. KoBALT: Korean Benchmark For Advanced Linguistic Tasks . Preprint, arXiv:2505.16125. Shayla Sharmin S...

  2. [2]

    In Findings of the Association for Computa- tional Linguistics: ACL 2024 , pages 12075–12097, Bangkok, Thailand

    PUB: A Pragmatics Understanding Bench- mark for Assessing LLMs’ Pragmatics Capabili- ties. In Findings of the Association for Computa- tional Linguistics: ACL 2024 , pages 12075–12097, Bangkok, Thailand. Association for Computational Linguistics. Shaila Sultana, Mian Md. Naushaad Kabir, Md. Zulfe- qar Haider, Mohammod Moninoor Roshid, and M. Obaidul Hamid...

  3. [3]

    In Proceedings of the 63rd An- nual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 25869– 25886, Vienna, Austria

    Culture is not trivia: Sociocultural theory for cultural NLP . In Proceedings of the 63rd An- nual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 25869– 25886, Vienna, Austria. Association for Computa- tional Linguistics. 11 A Additional Cultural Context A.1 Bangla Address Forms Address forms play a central role i...

  4. [4]

    If I say it strongly, this person can feel something bad

    Request People think like this: when I want someone to do something, it is not good to say it strongly. If I say it strongly, this person can feel something bad. Because of this, it is good to say it softly, so this person can choose freely

  5. [5]

    Asking about these things can feel intrusive

    Interrogation People think like this: some things are personal. Asking about these things can feel intrusive. If I ask directly, people can think I have bad manners. Because of this, it is good not to ask directly. 3.1 Compliment People think like this: when someone says something good about me, it is not good to agree openly. People can think I think too...

  6. [6]

    no” directly. If I do, this person can feel bad. It is better to say “I will try

    Refusal People think like this: if I do not want to do something, I should not say “no” directly. If I do, this person can feel bad. It is better to say “I will try” or “I will see.”

  7. [7]

    People cannot change these things

    Tautology People think like this: some things happen not because people want them. People cannot change these things. Saying words like this helps people feel calm

  8. [8]

    It is good to say a small sound or word so other people know what I feel

    Interjection People think like this: sometimes I feel something suddenly. It is good to say a small sound or word so other people know what I feel

  9. [9]

    I do not want to eat,

    Indirectness People think like this: when I want someone to do something, it is not good to order them. This person can feel pushed. Because of this, it is good to say it indirectly. 8.1 Hospitality People think like this: when someone comes to my place, it is good to offer food or drink. If I do not do this, people can think badly of me. 8.2 Hospitality ...

  10. [10]

    Not saying anything can feel cold or bad

    Greetings People think like this: when I see someone, it is good to say something first. Not saying anything can feel cold or bad

  11. [11]

    Other people can feel uncomfortable

    Emotion People think like this: when I feel something strong, it is not always good to show it in public. Other people can feel uncomfortable. Because of this, it is good to control feelings

  12. [12]

    It is not good to create tension

    Harmony People think like this: it is good when people feel good together. It is not good to create tension. Because of this, people speak and act gently

  13. [13]

    I want to be close to them, help them, and spend time with them

    Cordiality People think like this: people close to me are like family. I want to be close to them, help them, and spend time with them. Distance is not good

  14. [14]

    তার ভাইেয়র বাবা হেলন আমার দাদার একমাতৰ্ েছেল। মিহলািট অপর মিহলার েক হন?(Referring to a woman, another woman said,

    Criticism People think like this: if someone does something wrong, it is not good to say this in front of others. This person can feel shame. It is better to say it softly or later. 14 Time People think like this: when I say a time, it does not have to be exact. Being with people is more important than strict timekeeping. Table 3: Cultural scripts in the ...

  15. [15]

    A response is marked correct only if the selected op- tion exactly matches the gold answer

    <Option-4> শুধু সিঠক উত্তেরর নমব্রিট িলখুন (1/2/3/4)। [Respond with the option number only (1 / 2 / 3 / 4)] Model outputs are parsed using robust numeric extraction to account for formatting variation. A response is marked correct only if the selected op- tion exactly matches the gold answer. D.5.2 Random Baseline Accuracy Computation for Pronominal Addre...