SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding

Alham Fikri Aji; Attapol T. Rutherford; Can Udomcharoenchaikit; Jian Gang Ngui; Peerat Limkonchotiwat; Peerawat Chomphooyod; Sarana Nutanong; Yosephine Susanto

arxiv: 2606.03284 · v1 · pith:BGDAUDDQnew · submitted 2026-06-02 · 💻 cs.CL

SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding

Peerawat Chomphooyod , Jian Gang Ngui , Yosephine Susanto , Attapol T. Rutherford , Alham Fikri Aji , Sarana Nutanong , Can Udomcharoenchaikit , Peerat Limkonchotiwat This is my paper

Pith reviewed 2026-06-28 10:17 UTC · model grok-4.3

classification 💻 cs.CL

keywords natural language inferenceSoutheast Asiacultural knowledgelanguage model evaluationbenchmarkcultural understandingLLM adaptation

0 comments

The pith

All tested language models show low performance on Southeast Asian cultural NLI mainly due to missing cultural knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SEA-NLI, a native benchmark of natural language inference examples drawn from eight Southeast Asian countries and verified by native speakers in both English and local languages. Evaluation of seventeen encoder and decoder models finds consistently low accuracy, especially on knowledge-intensive topics such as Languages and Science and Technology. The authors conclude that most errors stem from gaps in SEA-specific cultural information rather than general reasoning shortfalls, with SEA-adapted models and culture-aware prompts producing gains while chain-of-thought prompting yields only limited benefit.

Core claim

SEA-NLI shows that frontier language models perform poorly on culturally grounded reasoning from Southeast Asia, with the performance gaps driven primarily by missing regional cultural knowledge that can be partially addressed through model adaptation to the region and culture-aware prompting strategies.

What carries the argument

The SEA-NLI benchmark, a set of premise-hypothesis pairs that test culturally specific Southeast Asian facts and norms across eight countries and multiple languages.

If this is right

SEA-adapted models achieve higher accuracy than general-purpose models on the benchmark.
Culture-aware prompting produces measurable improvements in model performance.
Chain-of-thought prompting offers only limited gains compared with culture-aware methods.
The largest drops occur in knowledge-intensive categories such as Languages and Science and Technology.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar knowledge gaps are likely present for other underrepresented regions when current models are tested on native cultural reasoning tasks.
Benchmarks focused on specific cultural contexts can be used to measure progress toward more geographically balanced language models.
Targeted data collection for underrepresented cultures may be required to close the gaps identified here.

Load-bearing premise

The benchmark items are verifiably culturally grounded and representative of Southeast Asian reasoning, and the observed performance gaps are driven primarily by missing cultural knowledge rather than language modeling difficulty, annotation artifacts, or task formulation.

What would settle it

A model given extensive additional training on Southeast Asian cultural texts and languages that still scores as poorly as the current models on SEA-NLI would challenge the claim that missing cultural knowledge is the main source of errors.

Figures

Figures reproduced from arXiv: 2606.03284 by Alham Fikri Aji, Attapol T. Rutherford, Can Udomcharoenchaikit, Jian Gang Ngui, Peerat Limkonchotiwat, Peerawat Chomphooyod, Sarana Nutanong, Yosephine Susanto.

**Figure 2.** Figure 2: Comparison of SEA-NLI with the existing NLI datasets [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Data statistics of SEA-NLI. 4 Experimental Setup Models. To answer RQ2 and RQ3, we employ 17 models across encoder-based and decoder-based models to be evaluated on the SEA-NLI benchmark. For the encoder-based evaluation, we fine-tune the pre-trained base models: XLM-R (Conneau et al., 2020), mmBERT (Marone et al., 2025), and mDeBERTa (He et al., 2023), using SNLI (Bowman et al., 2015) and XNLI (Conneau e… view at source ↗

**Figure 5.** Figure 5: Weighted F1 performance across cultural con [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Error taxonomy mapping English and SEA prediction outcomes to distinct model deficiencies. 7 Conclusion We introduce SEA-NLI, a culturally grounded NLI benchmark for evaluating encoder- and decoderbased models in Southeast Asian contexts. Our evaluation of 17 models shows that SEA-NLI remains challenging even for strong frontier and SEA-adapted models. Our analysis suggests that this degradation is drive… view at source ↗

**Figure 7.** Figure 7: Distribution of SEA-NLI samples across eight [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of Entailment Type Distributions [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 10.** Figure 10: The hypothesis generation prompt for the [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Evolution of the SEA-NLI word frequency distribution. All values are normalized to the percentage of [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 13.** Figure 13: Average character length of premises (top) [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Example of SEA-NLI in each regeneration step. Step 1 (Unfiltered Set) Step 2 Step 3 Step 4 SEA-NLI Improvement Step 0.0 0.1 0.2 0.3 0.4 0.5 Average Lexical overlap 0.50 0.36 0.35 0.35 0.41 0.30 0.29 0.29 0.30 0.30 0.39 0.35 0.35 0.36 0.34 Lexical Overlap by Improvement Step and Entailment Type Entailment Type Entailment Neutral Contradiction [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 16.** Figure 16: Inference speed vs. F1 score on the SEA-NLI benchmark. Results are shown for SEA performance (left) [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

**Figure 17.** Figure 17: Distribution of data quality flags by cul [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗

**Figure 19.** Figure 19: The hypothesis generation prompt. I Cultural-aware Prompting Building on the base prompt shown in Figure 20a, we employ culturally aware prompting to elicit the model’s latent SEA knowledge. This methodology integrates (i) target-culture metadata (Base+Cult.), (ii) target-culture-topic meta19 [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗

**Figure 20.** Figure 20: Evolution of prompt templates for culturally [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗

**Figure 21.** Figure 21: Error analysis of a cultural knowledge gap. [PITH_FULL_IMAGE:figures/full_fig_p021_21.png] view at source ↗

**Figure 22.** Figure 22: Average hypothesis length before and after [PITH_FULL_IMAGE:figures/full_fig_p022_22.png] view at source ↗

**Figure 23.** Figure 23: Average hypothesis length before and after [PITH_FULL_IMAGE:figures/full_fig_p022_23.png] view at source ↗

read the original abstract

Frontier LLMs perform well in Western contexts, but remain poorly tested on underrepresented cultures such as those in Southeast Asia (SEA). Existing NLI benchmarks are largely Western-centric, translation-derived, or monolingual, limiting their ability to measure culturally grounded reasoning. We introduce SEA-NLI, a native, culturally grounded NLI benchmark covering eight SEA countries in English and native regional languages, verified by native speakers. Across 17 encoder and decoder models, we observe a low performance from all models, especially for knowledge-intensive categories such as Languages and Science and Technology. Our analysis shows that failure cases mainly stem from missing SEA cultural knowledge: SEA-adapted models and culture-aware prompting improve performance, while CoT prompting offers limited gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SEA-NLI gives a new native multi-country NLI benchmark for Southeast Asia, but the claim that gaps come from missing cultural knowledge lacks controls that separate it from language modeling issues.

read the letter

The main takeaway is that this paper builds SEA-NLI, the first NLI dataset constructed natively across eight SEA countries in both English and local languages, with native-speaker verification instead of machine translation from Western sources.

It does a solid job creating that resource and running it on 17 models. The results show low overall accuracy, with bigger drops on knowledge-intensive categories like Languages and Science and Technology. SEA-adapted models and culture-aware prompts give some lift while chain-of-thought does less.

The soft spot is the causal story. The abstract attributes failures mainly to absent SEA cultural knowledge, but the provided details do not include controls that would isolate culture from tokenization problems, pretraining gaps, or annotation artifacts in low-resource languages. No inter-annotator agreement numbers or comparisons to translated Western NLI items in the same languages appear in the summary, so the attribution stays underdetermined.

This work is aimed at researchers building or evaluating multilingual and culturally aware models, particularly those focused on Southeast Asia. The dataset construction itself is the part worth engaging with.

It deserves peer review. A new benchmark in an underrepresented region needs external checks on verification process and analysis strength, even if the current evidence for the knowledge-gap explanation is thin.

Referee Report

2 major / 1 minor

Summary. The paper introduces SEA-NLI, a native, culturally grounded NLI benchmark spanning eight Southeast Asian countries in English and regional languages, verified by native speakers. It evaluates 17 encoder and decoder models, reports low overall performance (especially in knowledge-intensive categories such as Languages and Science and Technology), and attributes failures primarily to missing SEA cultural knowledge, with supporting evidence from gains under SEA-adapted models and culture-aware prompting (while CoT yields limited benefit).

Significance. If the items are verifiably representative and the performance gaps are driven by absent cultural knowledge rather than confounds, the benchmark would be a useful diagnostic for culturally inclusive LLM evaluation. The empirical scope across multiple models and prompting conditions is a strength; the native construction and verification process also addresses a clear gap in existing Western-centric or translated NLI resources.

major comments (2)

[Abstract] Abstract: The central claim that 'failure cases mainly stem from missing SEA cultural knowledge' is load-bearing for the paper's conclusions yet lacks isolating controls; no accuracy comparisons to translated Western NLI items in the same languages, culturally stripped variants, or quantitative measures of pretraining coverage/tokenization difficulty are reported, leaving the causal attribution underdetermined relative to language-modeling artifacts.
[Abstract] Abstract: Native-speaker verification is asserted but without reported inter-annotator agreement statistics, annotator count, exclusion criteria, or agreement thresholds on cultural grounding, which are required to substantiate that the observed category-level drops reflect genuine cultural knowledge gaps rather than annotation artifacts.

minor comments (1)

[Abstract] Abstract: The exact number of premise-hypothesis pairs, distribution across the eight countries, and breakdown by language (English vs. native) could be stated explicitly to allow readers to assess scale and balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and note planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'failure cases mainly stem from missing SEA cultural knowledge' is load-bearing for the paper's conclusions yet lacks isolating controls; no accuracy comparisons to translated Western NLI items in the same languages, culturally stripped variants, or quantitative measures of pretraining coverage/tokenization difficulty are reported, leaving the causal attribution underdetermined relative to language-modeling artifacts.

Authors: We agree that direct isolating controls (e.g., translated Western NLI items or culturally stripped variants) would provide stronger causal evidence. Our current support for the claim rests on the observed performance improvements from SEA-adapted models and culture-aware prompting, contrasted with limited gains from CoT. We will revise the discussion and limitations sections to explicitly acknowledge this gap and clarify that tokenization effects are partially addressed by the English and native-language versions, though quantitative pretraining coverage analysis is not feasible without additional resources. No new experiments will be added. revision: partial
Referee: [Abstract] Abstract: Native-speaker verification is asserted but without reported inter-annotator agreement statistics, annotator count, exclusion criteria, or agreement thresholds on cultural grounding, which are required to substantiate that the observed category-level drops reflect genuine cultural knowledge gaps rather than annotation artifacts.

Authors: We will update the manuscript to report the full details of the native-speaker verification process, including inter-annotator agreement statistics, annotator counts, exclusion criteria, and agreement thresholds. These data were collected during benchmark construction and will be added to the methods and appendix sections. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and model evaluation

full rationale

The paper introduces a new native NLI dataset for SEA cultures, verifies items with native speakers, and reports empirical accuracies across 17 models plus prompt variants. No derivations, equations, fitted parameters, or first-principles predictions exist that could reduce to inputs by construction. Claims about cultural knowledge gaps rest on observed performance differences and prompt improvements rather than any self-referential definition or self-citation chain. The work is self-contained against external benchmarks (model evaluations on the released dataset) and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the domain assumption that NLI is a suitable probe for cultural knowledge but introduces no free parameters, no new mathematical entities, and no ad-hoc constants.

axioms (1)

domain assumption Natural language inference can function as a lens into cultural understanding
The entire experimental design treats NLI accuracy as a proxy for possession of Southeast Asian cultural knowledge.

pith-pipeline@v0.9.1-grok · 5691 in / 1140 out tokens · 24954 ms · 2026-06-28T10:17:41.132658+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 3 canonical work pages · 2 internal anchors

[1]

probable error

Flans at semeval-2026 task 7: Rag with open-sourced smaller llms for everyday knowledge across diverse languages and cultures.Preprint, arXiv:2603.01910. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large anno- tated corpus for learning natural language inference. 9 InProceedings of the 2015 Conference on Empiri- ...

work page arXiv 2026
[2]

Gemma 3 Technical Report

Global MMLU: Understanding and addressing cultural and linguistic biases in multilingual evalua- tion. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 18761–18799, Vienna, Austria. 11 Joe Stacey, Lisa Alazraki, Aran Ubhi, Beyza Ermis, Aaron Mueller, and Marek Rei. 2026. Improving t...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Qwen3 Technical Report

Not all countries celebrate thanksgiving: On the cultural dominance in large language models. In Proceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 6349–6384, Bangkok, Thailand. Gijs Wijnholds and Michael Moortgat. 2021. SICK- NL: A dataset for Dutch natural language inference. InProcee...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

- Neutral MUST remain neither entailed nor contradicted

LABEL CONSISTENCY (MANDATORY) - Entailment MUST remain logically entailed by the premise. - Neutral MUST remain neither entailed nor contradicted. - Contradiction MUST remain logically incompatible with the premise
[5]

- Do NOT introduce unrelated events

SAME UNDERLYING FACT - All three hypotheses MUST refer to the SAME situation/context as the premise. - Do NOT introduce unrelated events
[6]

not", "never

ANTI-SHORTCUT DESIGN - Avoid lexical overlap between premise and hypotheses. - Do NOT reuse key nouns or verbs directly. - Avoid negation tricks (e.g., "not", "never") as the main signal
[7]

REASONING COMPLEXITY - Require multi-step reasoning (implicit facts, cultural inference, temporal or causal reasoning)
[8]

CULTURAL GROUNDING (SOUTHEAST ASIA) - Must require specific cultural knowledge from: {culture}
[9]

LOGICAL PLAUSIBILITY - All sentences must be realistic and natural
[10]

ONE SENTENCE ONLY - Each premise and hypothesis must be EXACTLY one sentence
[11]

LANGUAGE CONSISTENCY - Native = {culture} language - English = fluent and natural - Both versions must express the SAME meaning
[12]

idx": "{idx}

ENTRY ID PRESERVATION (CRITICAL) - You MUST return the EXACT same entry_id values - DO NOT modify or regenerate them --------------------- INPUT DATA --------------------- IDX: {idx} Premise Native: {premise_native} Premise English: {premise_english} Entailment Hypothesis Native: {hypothesis_native_entailment} Neutral Hypothesis Native: {hypothesis_native...
[13]

(2025), this metric assesses how effectively the generated content aligns with the intended Southeast Asian (SEA) cul- tural context

Cultural Relevance Score:Adapted from Cahyawijaya et al. (2025), this metric assesses how effectively the generated content aligns with the intended Southeast Asian (SEA) cul- tural context. • Score 5 (Unique to SEA):The premise describes traditions, objects, or landmarks that originate in SEA and are considered iconic, such as Pad Thai, Batik, Songkran, ...

2025
[14]

Cultural Understanding Score:This score quantifies the annotator’s personal familiarity with the specific cultural context of the sam- ple. 16 10 100 1000 Inference Speed (samples/sec) 30 40 50 60 70 80 90 100Weighted F1 Score Hard Set mbdeberta-base-snli-xnli xlm-r-large-snli-xnli mmbert-base-snli-xnli Qwen3-VL-8B-Instruct Qwen3.5-9B Qwen-SEA-LION-v4-8B-...
[15]

• Score 5 (Excellent):The sentence is natu- ral, grammatically perfect, and provides a clear cultural context

Quality Score:This metric evaluates the lin- guistic clarity and contextual accuracy of the premise and its associated metadata. • Score 5 (Excellent):The sentence is natu- ral, grammatically perfect, and provides a clear cultural context. • Score 4 (Good):The sentence is clear and usable, with only minor stylistic awkward- ness. • Score 3 (Fair):The sent...
[16]

Language

Flagging Issues:Annotators identify specific qualitative concerns that may impact the relia- bility of the sample. These include: • Linguistic Error:Significant grammar, spelling, or translation issues. • Factual Inaccuracy:The premise con- tains incorrect information regarding the culture or location. • Ambiguous Context:The statement is too vague to det...

1979
[17]

might,"

Prohibit words**: "might," "may," "also," "later," "should," "wish," "perhaps," "possibly," "likely."
[18]

A person might do X,

Replacement stategy**: Instead of saying "A person might do X," describe a specific action or state that is neither confirmed nor denied by the premise (e.g., "The person performs X"). The "Neutral" status must come from a lack of evidence in the premise, not from vague wording in the hypothesis. you MUST revise the given premises to remove cultural knowl...

[1] [1]

probable error

Flans at semeval-2026 task 7: Rag with open-sourced smaller llms for everyday knowledge across diverse languages and cultures.Preprint, arXiv:2603.01910. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large anno- tated corpus for learning natural language inference. 9 InProceedings of the 2015 Conference on Empiri- ...

work page arXiv 2026

[2] [2]

Gemma 3 Technical Report

Global MMLU: Understanding and addressing cultural and linguistic biases in multilingual evalua- tion. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 18761–18799, Vienna, Austria. 11 Joe Stacey, Lisa Alazraki, Aran Ubhi, Beyza Ermis, Aaron Mueller, and Marek Rei. 2026. Improving t...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Qwen3 Technical Report

Not all countries celebrate thanksgiving: On the cultural dominance in large language models. In Proceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 6349–6384, Bangkok, Thailand. Gijs Wijnholds and Michael Moortgat. 2021. SICK- NL: A dataset for Dutch natural language inference. InProcee...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

- Neutral MUST remain neither entailed nor contradicted

LABEL CONSISTENCY (MANDATORY) - Entailment MUST remain logically entailed by the premise. - Neutral MUST remain neither entailed nor contradicted. - Contradiction MUST remain logically incompatible with the premise

[5] [5]

- Do NOT introduce unrelated events

SAME UNDERLYING FACT - All three hypotheses MUST refer to the SAME situation/context as the premise. - Do NOT introduce unrelated events

[6] [6]

not", "never

ANTI-SHORTCUT DESIGN - Avoid lexical overlap between premise and hypotheses. - Do NOT reuse key nouns or verbs directly. - Avoid negation tricks (e.g., "not", "never") as the main signal

[7] [7]

REASONING COMPLEXITY - Require multi-step reasoning (implicit facts, cultural inference, temporal or causal reasoning)

[8] [8]

CULTURAL GROUNDING (SOUTHEAST ASIA) - Must require specific cultural knowledge from: {culture}

[9] [9]

LOGICAL PLAUSIBILITY - All sentences must be realistic and natural

[10] [10]

ONE SENTENCE ONLY - Each premise and hypothesis must be EXACTLY one sentence

[11] [11]

LANGUAGE CONSISTENCY - Native = {culture} language - English = fluent and natural - Both versions must express the SAME meaning

[12] [12]

idx": "{idx}

ENTRY ID PRESERVATION (CRITICAL) - You MUST return the EXACT same entry_id values - DO NOT modify or regenerate them --------------------- INPUT DATA --------------------- IDX: {idx} Premise Native: {premise_native} Premise English: {premise_english} Entailment Hypothesis Native: {hypothesis_native_entailment} Neutral Hypothesis Native: {hypothesis_native...

[13] [13]

(2025), this metric assesses how effectively the generated content aligns with the intended Southeast Asian (SEA) cul- tural context

Cultural Relevance Score:Adapted from Cahyawijaya et al. (2025), this metric assesses how effectively the generated content aligns with the intended Southeast Asian (SEA) cul- tural context. • Score 5 (Unique to SEA):The premise describes traditions, objects, or landmarks that originate in SEA and are considered iconic, such as Pad Thai, Batik, Songkran, ...

2025

[14] [14]

Cultural Understanding Score:This score quantifies the annotator’s personal familiarity with the specific cultural context of the sam- ple. 16 10 100 1000 Inference Speed (samples/sec) 30 40 50 60 70 80 90 100Weighted F1 Score Hard Set mbdeberta-base-snli-xnli xlm-r-large-snli-xnli mmbert-base-snli-xnli Qwen3-VL-8B-Instruct Qwen3.5-9B Qwen-SEA-LION-v4-8B-...

[15] [15]

• Score 5 (Excellent):The sentence is natu- ral, grammatically perfect, and provides a clear cultural context

Quality Score:This metric evaluates the lin- guistic clarity and contextual accuracy of the premise and its associated metadata. • Score 5 (Excellent):The sentence is natu- ral, grammatically perfect, and provides a clear cultural context. • Score 4 (Good):The sentence is clear and usable, with only minor stylistic awkward- ness. • Score 3 (Fair):The sent...

[16] [16]

Language

Flagging Issues:Annotators identify specific qualitative concerns that may impact the relia- bility of the sample. These include: • Linguistic Error:Significant grammar, spelling, or translation issues. • Factual Inaccuracy:The premise con- tains incorrect information regarding the culture or location. • Ambiguous Context:The statement is too vague to det...

1979

[17] [17]

might,"

Prohibit words**: "might," "may," "also," "later," "should," "wish," "perhaps," "possibly," "likely."

[18] [18]

A person might do X,

Replacement stategy**: Instead of saying "A person might do X," describe a specific action or state that is neither confirmed nor denied by the premise (e.g., "The person performs X"). The "Neutral" status must come from a lack of evidence in the premise, not from vague wording in the hypothesis. you MUST revise the given premises to remove cultural knowl...