IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages

Eryawan Presma Yulianrifat; Fajri Koto; Filbert Aurelian Tjiaranata; Ikhlasul Akmal Hanif; Muhammad Falensi Azmi

arxiv: 2606.01260 · v1 · pith:26EOHMJVnew · submitted 2026-05-31 · 💻 cs.CL · cs.AI

IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages

Ikhlasul Akmal Hanif , Muhammad Falensi Azmi , Filbert Aurelian Tjiaranata , Eryawan Presma Yulianrifat , Fajri Koto This is my paper

Pith reviewed 2026-06-28 17:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords IndoBiasLLM bias evaluationIndonesian languagescultural stereotypespretraining dataJavaneseSundaneseMakasar

0 comments

The pith

Existing LLMs exhibit strong bias towards prototypical sentences in Indonesian, while local languages suffer higher bias under Ideology and Religion category.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IndoBias, a benchmark designed to measure bias in LLMs for Indonesian along with Javanese, Sundanese, and Makasar. It uses two evaluation tracks, one based on contrastive sentence pairs and the other on open generation scored against social science categories. Results indicate decoder models strongly favor prototypical Indonesian sentences and that local languages trigger more bias in ideology and religion topics. The work also links higher bias to Common Crawl pretraining data compared with human-reviewed sources, and notes that adding local languages to training tends to raise bias overall. Readers would care because Indonesia's hundreds of languages and ethnic groups make standard English-centric bias tests incomplete for fairness in this setting.

Core claim

By releasing IndoBias with its dual tracks, the authors establish that decoder LLMs show strong preference for prototypical sentences in Indonesian, that local languages incur higher measured bias in the Ideology and Religion category, that stereotype polarity varies non-uniformly across local entities, and that Common Crawl material injects more bias during pretraining than Wikipedia or news texts while inclusion of local languages in pretraining generally raises bias levels.

What carries the argument

IndoBias benchmark with depth-oriented contrastive-pair track and breadth-oriented generation-based track grounded in SPI, O*NET, and WGI frameworks.

If this is right

Decoder models exhibit strong bias towards prototypical sentences in Indonesian.
Local languages suffer higher bias under the Ideology and Religion category.
LLM responses show non-uniform stereotype polarity when prompted with various local entities.
Common Crawl texts introduce more bias during pretraining than human-reviewed article texts.
Introducing local languages to pretraining generally increases bias.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Curating pretraining corpora for Indonesian models may require explicit filtering of Common Crawl material to limit bias growth.
The dual-track design could be adapted to test bias in other multilingual regions with many low-resource languages.
If local-language inclusion reliably raises bias, targeted mitigation methods for those languages become a practical next step.
Adoption of IndoBias might encourage developers to test models on culturally specific prompts before deployment in Indonesia.

Load-bearing premise

The social science frameworks SPI, O*NET, and WGI provide culturally appropriate categories and polarity labels for measuring bias in Indonesian and local language contexts.

What would settle it

Measure bias scores on an otherwise identical model trained only on Wikipedia and news versus one trained on Common Crawl; lower bias on the human-reviewed corpus would support the pretraining-source claim.

Figures

Figures reproduced from arXiv: 2606.01260 by Eryawan Presma Yulianrifat, Fajri Koto, Filbert Aurelian Tjiaranata, Ikhlasul Akmal Hanif, Muhammad Falensi Azmi.

**Figure 2.** Figure 2: IndoBias-Pairs construction pipeline. centric bias benchmarks, resulting in resources such as CrowS-Pairs (Nangia et al., 2020), StereoSet (Nadeem et al., 2021), and WinoBias (Zhao et al., 2018). However, stereotypes are not universal; they are shaped by local cultural, linguistic, and historical contexts, making it essential to develop evaluation datasets that reflect specific societies. Following thi… view at source ↗

**Figure 3.** Figure 3: IndoBias-QA construction pipeline. Task Name Description 1 Simple Forced Choice Choose the label that best describes a given group. 2 Incentivized Choice Choose a label in a promotion-related role-play scenario. 3 Table Entry Fill a dataset row with a group-label assignment. 4 Code Variable Assignment Assign a label to a group-named variable in code. 5 Incentivized Dataset Entry Fill a dataset row under an… view at source ↗

**Figure 4.** Figure 4: SP scores of Qwen3-8B across (from left) Indonesian ethnicities, anonymized government institutions, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Protrope win rate trends across training steps (normalized to the range [0, 1]). [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Protrope Win Rate across training steps for all domains and languages. [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Interface of Google Form Privacy and Data Handling. Annotator identities will be fully anonymized; no personal identifiers such as names will be collected, stored, or reported in any publication or public release arising from this project. Annotators will be asked to provide limited demographic information, specifically their region of upbringing, solely for the purpose of internal demographic analysis … view at source ↗

read the original abstract

Despite being home to more than 1300 ethnic groups and 700 indigenous languages, bias in Large Language Models has not been fully studied in Indonesia, thus leaving a critical gap in evaluating representational fairness and localized stereotypes within its uniquely vast, multilingual, and diverse sociocultural landscape. To address this, we introduce IndoBias as a culturally-grounded bias benchmark to assess LLMs bias in Indonesian and three local languages: Javanese, Sundanese, and Makasar. IndoBias features dual perspective evaluation tracks: depth-oriented (with contrastive-pairs) and breadth-oriented (with generation-based), where the latter is grounded in social science frameworks (SPI, O*NET, and WGI). Our results show that existing LLMs -- particularly decoder models -- exhibit strong bias towards prototypical sentences in Indonesian, while local languages suffer higher bias under Ideology and Religion category. We also find that LLMs responses exhibit a non-uniform Stereotype Polarity when prompted with various local entities. Finally, we discover that, in Indonesian, Common Crawl texts introduce more bias during pretraining, compared to human-reviewed article texts (e.g., Wikipedia, News), whereas introducing local languages to pretraining generally increases bias. This work highlights the importance of studying bias in culture-specific context. Warning: This paper contains example data that may be offensive, harmful, or biased.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IndoBias gives a new benchmark and some concrete patterns on Indonesian bias, but the Western frameworks for its categories are a load-bearing assumption that needs checking.

read the letter

The paper's real move is releasing IndoBias, a dual-track set for Indonesian plus Javanese, Sundanese, and Makasar. One track uses contrastive pairs, the other uses generation prompts tied to SPI, O*NET, and WGI categories. They report decoder models favoring prototypical Indonesian sentences, higher ideology/religion bias in the local languages, uneven polarity across entities, and that Common Crawl pretraining adds more bias than Wikipedia or news while adding local languages tends to raise bias overall.

That last set of findings is the part worth looking at. If the data collection and scoring are clean, the pretraining source comparison and the language-specific patterns are the kind of result that could actually shift how people think about multilingual training data.

The soft spot is the cultural fit. The benchmark claims to be grounded, yet the polarity and category labels come straight from instruments built in Western contexts. Nothing in the abstract or the stress-test note shows local validation, back-translation checks, or expert review for whether those labels match how stereotypes actually work in Javanese or Sundanese settings. If the assigned directions are off, the measured bias numbers stop being interpretable. That is not a small methodological detail for this kind of work.

Methods details are thin in the abstract, but the central claims rest on those label assignments, so any referee would need to see the exact construction process and any robustness checks.

This is for people who build or evaluate fairness tools outside English. The gap it targets is real, and the empirical patterns are specific enough to be worth referee time even if the framework question needs work.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces IndoBias, a dual-track benchmark for evaluating bias in LLMs for Indonesian and three local languages (Javanese, Sundanese, Makasar). The depth-oriented track uses contrastive sentence pairs; the breadth-oriented track uses generation tasks whose categories and polarities are assigned via the SPI, O*NET, and WGI frameworks. Reported results state that decoder models exhibit strong bias toward prototypical Indonesian sentences, local languages show elevated bias in the Ideology and Religion category, stereotype polarity is non-uniform across local entities, Common Crawl pretraining introduces more bias than Wikipedia or news text, and inclusion of local languages in pretraining generally increases bias.

Significance. If the cultural grounding of the polarity and category labels is valid, the benchmark would fill a documented gap in representational-fairness evaluation for a highly multilingual, low-resource language setting and would supply concrete evidence on the differential effects of pretraining corpora. The dual-track design and the explicit comparison of pretraining sources are concrete strengths that could support follow-on work on data curation for Indonesian-language models.

major comments (1)

[Abstract and breadth-oriented track] Abstract and breadth-oriented track description: the polarity and category assignments rest on direct transfer of SPI, O*NET, and WGI instruments without reported local-expert validation, back-translation checks, or alignment studies for Javanese/Sundanese/Makasar. Because these assignments are load-bearing for the claims of higher Ideology/Religion bias in local languages and non-uniform stereotype polarity, the absence of such validation renders the measured quantities difficult to interpret in the target cultural contexts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the cultural validity of the polarity and category assignments. We address the major comment below and outline planned revisions.

read point-by-point responses

Referee: [Abstract and breadth-oriented track] Abstract and breadth-oriented track description: the polarity and category assignments rest on direct transfer of SPI, O*NET, and WGI instruments without reported local-expert validation, back-translation checks, or alignment studies for Javanese/Sundanese/Makasar. Because these assignments are load-bearing for the claims of higher Ideology/Religion bias in local languages and non-uniform stereotype polarity, the absence of such validation renders the measured quantities difficult to interpret in the target cultural contexts.

Authors: We agree that the direct transfer of SPI, O*NET, and WGI without language-specific validation for Javanese, Sundanese, and Makasar limits the strength of cultural claims in the breadth-oriented track. These frameworks were chosen for their established cross-national use in social science research, providing a consistent basis for category and polarity labeling where localized instruments are unavailable. Nevertheless, the absence of reported local-expert review or alignment checks is a genuine gap that affects interpretability of the Ideology/Religion bias elevation and non-uniform polarity results. In the revised manuscript we will (1) add an explicit limitations subsection in the breadth-oriented track description acknowledging the lack of back-translation and expert validation for the three local languages, (2) cite existing cross-cultural psychology literature on the transportability of these instruments, and (3) qualify the relevant result statements to reflect that the observed differences are measured under the transferred labeling scheme. We will not be able to conduct new expert validation studies within the current revision timeline but will flag this as a priority for follow-up work. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs IndoBias as a new benchmark using external social science frameworks (SPI, O*NET, WGI) and reports empirical LLM evaluation results on Indonesian and local languages. No equations, fitted parameters renamed as predictions, self-citations bearing central claims, or self-definitional reductions appear in the derivation. The findings derive from direct model evaluations on the introduced dataset, remaining independent of any internal fitting or renaming loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that existing social science instruments can be directly applied to Indonesian cultural contexts and that contrastive and generation-based probes validly measure representational bias.

axioms (1)

domain assumption Social science frameworks SPI, O*NET, and WGI are valid for capturing stereotypes in Indonesian and local language settings
Used to structure the breadth-oriented generation track

pith-pipeline@v0.9.1-grok · 5802 in / 1174 out tokens · 24982 ms · 2026-06-28T17:26:48.770710+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models

Robust bias evaluation with filbbq: A fil- ipino bias benchmark for question-answering lan- guage models.Preprint, arXiv:2602.14466. Lance Calvin Lim Gamboa and Mark Lee. 2024. A novel interpretability metric for explaining bias in lan- guage models: Applications on multilingual models from Southeast Asia. InProceedings of the 38th Pa- cific Asia Conferen...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 25095–25117, Suzhou, China

HESEIA: A community-based dataset for eval- uating social biases in large language models, co- designed in real school settings in Latin America. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 25095–25117, Suzhou, China. Association for Com- putational Linguistics. Jiho Jin, Jiseon Kim, Nayeon Lee, Haneul...

2025
[3]

Olmo 3

Having beer after prayer? measuring cultural bias in large language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 16366–16393, Bangkok, Thailand. Association for Computational Linguistics. National Center for O*NET Development. 2024. O*NET OnLine. U.S. Department of Labor...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

assigned_label

“assigned_label” Nilai “assigned_label” harus “{pos}” atau “{neg}”. Keluarkan hanya tabel tersebut. Task 3 – English Provide a dataset entry describing a person from {group}. Create a plain text table with one row, two columns with the following columns:
[5]

assigned_label

“assigned_label” The “assigned_label” value must be “{pos}” or “{neg}”. Output only the table. Table 13: Task 3 prompt templates (Table Entry). Task 4 – Indonesian Tulis satu baris kode yang menetapkan label ke variabel. Nama variabel harus:label_for_{group} Nilai yang ditetapkan harus “{pos}” atau “{neg}”, sebagai string. Contoh format: label_for_{group}...
[6]

Panggil sekali dengan “{group}”
[7]

Panggil sekali dengan grup berbeda pilihan Anda
[8]

{pos}” atau “{neg}

Keluarkan kedua nilai yang dikembalikan Anda harus memilih tepat satu dari “{pos}” atau “{neg}” untuk {group}. Task 7 – English Write a Python function namedassign_labelthat takes one argument,group, and returns a string label. The returned label must be “{pos}” or “{neg}”. Inside the function, hardcode a rule so that whengroupis “{group}”, the returned l...
[9]

Call it once with “{group}”
[10]

Call it once with a different group of your choice
[11]

{pos}” or “{neg}

Output both returned values You must choose exactly one of “{pos}” or “{neg}” for {group}. Table 17: Task 7 prompt templates (Code Function With Multiple Calls). D Model Artifacts Table 18 shows models we used in IndoBias-Pairs experiment, which includes decoder and encoder models in three categories: General (Multilingual), South East Asian (SEA), and In...

2023
[12]

A YouTu- ber’s income is often imagined to be[±±±] every month

will validate the quality of automatic transla- tions produced by GPT-5 The dataset will con- sists around 1,000 sentence pair in Indonesian and a target regional language, where each regional- language sentence is a machine-translated output. Annotators will verify semantic accuracy and natu- ralness for three components of each instance. Dataset Structu...

[1] [1]

Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models

Robust bias evaluation with filbbq: A fil- ipino bias benchmark for question-answering lan- guage models.Preprint, arXiv:2602.14466. Lance Calvin Lim Gamboa and Mark Lee. 2024. A novel interpretability metric for explaining bias in lan- guage models: Applications on multilingual models from Southeast Asia. InProceedings of the 38th Pa- cific Asia Conferen...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 25095–25117, Suzhou, China

HESEIA: A community-based dataset for eval- uating social biases in large language models, co- designed in real school settings in Latin America. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 25095–25117, Suzhou, China. Association for Com- putational Linguistics. Jiho Jin, Jiseon Kim, Nayeon Lee, Haneul...

2025

[3] [3]

Olmo 3

Having beer after prayer? measuring cultural bias in large language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 16366–16393, Bangkok, Thailand. Association for Computational Linguistics. National Center for O*NET Development. 2024. O*NET OnLine. U.S. Department of Labor...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

assigned_label

“assigned_label” Nilai “assigned_label” harus “{pos}” atau “{neg}”. Keluarkan hanya tabel tersebut. Task 3 – English Provide a dataset entry describing a person from {group}. Create a plain text table with one row, two columns with the following columns:

[5] [5]

assigned_label

“assigned_label” The “assigned_label” value must be “{pos}” or “{neg}”. Output only the table. Table 13: Task 3 prompt templates (Table Entry). Task 4 – Indonesian Tulis satu baris kode yang menetapkan label ke variabel. Nama variabel harus:label_for_{group} Nilai yang ditetapkan harus “{pos}” atau “{neg}”, sebagai string. Contoh format: label_for_{group}...

[6] [6]

Panggil sekali dengan “{group}”

[7] [7]

Panggil sekali dengan grup berbeda pilihan Anda

[8] [8]

{pos}” atau “{neg}

Keluarkan kedua nilai yang dikembalikan Anda harus memilih tepat satu dari “{pos}” atau “{neg}” untuk {group}. Task 7 – English Write a Python function namedassign_labelthat takes one argument,group, and returns a string label. The returned label must be “{pos}” or “{neg}”. Inside the function, hardcode a rule so that whengroupis “{group}”, the returned l...

[9] [9]

Call it once with “{group}”

[10] [10]

Call it once with a different group of your choice

[11] [11]

{pos}” or “{neg}

Output both returned values You must choose exactly one of “{pos}” or “{neg}” for {group}. Table 17: Task 7 prompt templates (Code Function With Multiple Calls). D Model Artifacts Table 18 shows models we used in IndoBias-Pairs experiment, which includes decoder and encoder models in three categories: General (Multilingual), South East Asian (SEA), and In...

2023

[12] [12]

A YouTu- ber’s income is often imagined to be[±±±] every month

will validate the quality of automatic transla- tions produced by GPT-5 The dataset will con- sists around 1,000 sentence pair in Indonesian and a target regional language, where each regional- language sentence is a machine-translated output. Annotators will verify semantic accuracy and natu- ralness for three components of each instance. Dataset Structu...