pith. sign in

arxiv: 2604.07119 · v1 · submitted 2026-04-08 · 💻 cs.CL

Are Non-English Papers Reviewed Fairly? Language-of-Study Bias in NLP Peer Reviews

Pith reviewed 2026-05-10 18:29 UTC · model grok-4.3

classification 💻 cs.CL
keywords peer review biaslanguage of studyNLPnon-English papersLOBSTER datasetreview fairnessbias detection
0
0 comments X

The pith

Non-English papers in NLP face substantially higher rates of language-of-study bias in peer reviews than English-only papers, with negative bias outweighing positive forms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether reviewers judge papers differently based on the languages they study rather than their scientific content. It introduces the LOBSTER dataset of human-annotated reviews and a detection method to identify this bias, then applies them to over 15,000 reviews from NLP venues. The analysis shows that papers focused on non-English languages receive more biased comments overall, and negative biases such as unjustified demands for cross-lingual generalization appear more frequently than positive ones. This finding matters because peer review is meant to assess merit, yet such patterns could systematically disadvantage research on diverse languages and limit the field's inclusivity. The authors release their resources to enable further study of fairer reviewing practices.

Core claim

Through the LOBSTER dataset and automated detection on 15,645 reviews, the work establishes that language-of-study bias occurs at higher rates for non-English papers than for English-only ones, negative bias consistently exceeds positive bias, and the dominant negative subcategory is demanding unjustified cross-lingual generalization. The authors distinguish negative and positive forms of this bias, provide the first systematic breakdown into four negative subcategories, and show that existing review guidelines have not eliminated the pattern.

What carries the argument

The LOBSTER human-annotated dataset of peer reviews together with a classifier that detects language-of-study bias at 87.37 macro F1, used to quantify bias rates and subcategories across thousands of reviews.

If this is right

  • Non-English papers encounter higher overall bias rates, with negative comments dominating.
  • Demanding unjustified cross-lingual generalization is the most common negative bias subcategory.
  • English-only papers show lower bias incidence, indicating the effect is tied to the languages studied.
  • Public release of LOBSTER and the detection method enables targeted interventions in reviewing.
  • Four distinct negative bias subcategories can now be tracked separately in future analyses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar language-of-study patterns may appear in peer review for other disciplines that value multilingual data.
  • Integrating bias-detection tools into review platforms could flag problematic comments before they reach authors.
  • Increasing linguistic diversity among reviewers might reduce the frequency of cross-lingual generalization demands.
  • Papers on low-resource or under-studied languages could experience even stronger effects if the bias scales with perceived language rarity.

Load-bearing premise

Human annotations for the LOBSTER dataset correctly isolate language-of-study bias from other aspects of review quality or reviewer style.

What would settle it

A fresh round of annotations on the same reviews by a new set of annotators that produces substantially lower rates of language-of-study bias labels for non-English papers, or a controlled trial sending identical paper content to reviewers with only the studied language varied.

Figures

Figures reproduced from arXiv: 2604.07119 by Abdulfattah Safa, Ehsan Barkhordar, Erika Lombart, G\"ozde G\"ul \c{S}ahin, Marie-Catherine de Marneffe, Verena Blaschke.

Figure 1
Figure 1. Figure 1: Negative and positive language-of-study bias [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: GEMINI-3.1-PRO-PREVIEW confusion ma￾trix: Negative Bias, Positive Bias, No Bias Detected. in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bias rate (x-axis, %) by polarity across lan [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of negative bias subcategories (A– [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Top 20 non-English studied languages in the full analysis corpus (Table [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of contribution types in the full analysis corpus (Table [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Bias rate breakdowns by (a) venue year and (b) contribution type. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cross-tabulation of studied language and contribution type, showing the number of papers at each [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Decomposition of per-language negative and positive bias rates by paper contribution type. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

Peer review plays a central role in the NLP publication process, but is susceptible to various biases. Here, we study language-of-study (LoS) bias: the tendency for reviewers to evaluate a paper differently based on the language(s) it studies, rather than its scientific merit. Despite being explicitly flagged in reviewing guidelines, such biases are poorly understood. Prior work treats such comments as part of broader categories of weak or unconstructive reviews without defining them as a distinct form of bias. We present the first systematic characterization of LoS bias, distinguishing negative and positive forms, and introduce the human-annotated dataset LOBSTER (Language-Of-study Bias in ScienTific pEer Review) and a method achieving 87.37 macro F1 for detection. We analyze 15,645 reviews to estimate how negative and positive biases differ with respect to the LoS, and find that non-English papers face substantially higher bias rates than English-only ones, with negative bias consistently outweighing positive bias. Finally, we identify four subcategories of negative bias, and find that demanding unjustified cross-lingual generalization is the most dominant form. We publicly release all resources to support work on fairer reviewing practices in NLP and beyond.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that language-of-study (LoS) bias exists in NLP peer review, with non-English papers facing substantially higher rates of negative bias than English-only papers (negative bias outweighing positive). It introduces the human-annotated LOBSTER dataset, a detector achieving 87.37 macro F1, analyzes 15,645 reviews to quantify bias differentials by LoS, identifies four subcategories of negative bias (with unjustified cross-lingual generalization demands being dominant), and publicly releases all resources.

Significance. If the annotations validly isolate LoS bias, the work provides the first systematic large-scale evidence of this bias in NLP reviewing, which could inform guideline revisions and fairness interventions. The public release of LOBSTER and the detector is a clear strength that supports reproducibility and follow-on studies.

major comments (3)
  1. [§3] §3 (LOBSTER dataset construction): No inter-annotator agreement, annotation guidelines, or sampling strategy for the 15,645 reviews is reported. Without these, it is impossible to confirm that the labels isolate LoS bias from correlated signals such as overall review tone, perceived paper quality, or reviewer background, directly undermining the headline differential reported in §5.
  2. [§5] §5 (bias rate analysis): The comparison of negative/positive bias rates across LoS categories lacks any regression controls or matching for confounders (e.g., paper merit proxies, reviewer expertise, or review language). The reported 'substantially higher' rates for non-English papers therefore cannot be attributed solely to LoS.
  3. [§4] §4 (detection method): The 87.37 macro F1 is presented without ablation on the contribution of LoS-specific features versus general review-quality signals, leaving open whether the detector itself is capturing the intended bias or proxy variables.
minor comments (2)
  1. [Abstract / §1] The abstract and §1 should explicitly define positive vs. negative LoS bias with an example sentence from a review.
  2. [§5] Table 1 (or equivalent) reporting subcategory frequencies should include confidence intervals or statistical tests for the dominance claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback, which identifies key areas for improving the transparency and robustness of our work. We respond point-by-point to the major comments below, indicating where we will revise the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (LOBSTER dataset construction): No inter-annotator agreement, annotation guidelines, or sampling strategy for the 15,645 reviews is reported. Without these, it is impossible to confirm that the labels isolate LoS bias from correlated signals such as overall review tone, perceived paper quality, or reviewer background, directly undermining the headline differential reported in §5.

    Authors: We agree that explicit reporting of these elements is necessary to substantiate that the LOBSTER annotations isolate LoS bias. The submitted version described the annotation task at a high level but omitted the inter-annotator agreement statistics, the complete guidelines, and the precise sampling procedure. In the revised manuscript we will add Cohen's kappa for IAA, reproduce the full annotation guidelines in an appendix, and detail the stratified sampling from the 15,645 reviews. These additions will allow readers to evaluate whether the labels target LoS-specific signals rather than general negativity or quality. revision: yes

  2. Referee: [§5] §5 (bias rate analysis): The comparison of negative/positive bias rates across LoS categories lacks any regression controls or matching for confounders (e.g., paper merit proxies, reviewer expertise, or review language). The reported 'substantially higher' rates for non-English papers therefore cannot be attributed solely to LoS.

    Authors: The referee is correct that the §5 results are presented as raw rate comparisons without regression or matching. While the magnitude and consistency of the differentials across LoS categories provide suggestive evidence of bias, we cannot claim sole attribution without controls. In revision we will add logistic regression models that incorporate available proxies (review length, sentiment polarity, and review language) and will explicitly discuss the absence of reviewer-expertise metadata as a limitation. This will qualify the headline claim while preserving the observed pattern. revision: partial

  3. Referee: [§4] §4 (detection method): The 87.37 macro F1 is presented without ablation on the contribution of LoS-specific features versus general review-quality signals, leaving open whether the detector itself is capturing the intended bias or proxy variables.

    Authors: We acknowledge that the absence of feature ablations leaves the source of the detector's performance ambiguous. The model was trained on LoS-bias annotations, yet we did not isolate the contribution of language-related features. In the revised paper we will include an ablation study that removes LoS-specific lexical and syntactic features and reports the resulting drop in macro F1, thereby demonstrating that performance is not driven solely by general review-quality signals. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations or self-referential reductions

full rationale

The paper performs an empirical analysis of peer reviews, creating the LOBSTER dataset via human annotation and applying statistical methods to 15,645 reviews to quantify language-of-study bias. No equations, derivations, fitted parameters presented as predictions, or self-citation chains appear in the provided text. Central claims rest on observed annotation outcomes and bias rate differentials rather than reducing to definitional inputs or prior self-referential results by construction. The work is self-contained against external benchmarks of review data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest on the assumption that bias can be objectively annotated from reviews and that the analyzed reviews are representative of NLP peer review practices. No free parameters or invented entities are described.

axioms (1)
  • domain assumption Human annotators can reliably detect language-of-study bias in peer review text.
    The entire analysis and detection method depend on the quality and consistency of these human annotations for the LOBSTER dataset.

pith-pipeline@v0.9.0 · 5554 in / 1277 out tokens · 92455 ms · 2026-05-10T18:29:08.459069+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    https://arr-data.aclweb.org/

    ARR Data Collection Initiative 2025 (v1.1). https://arr-data.aclweb.org/. Dataset; ob- tained via donation-based peer review data collection from ACL Rolling Review. Meta AI. 2025. Llama 4 Model Card and Technical Specifications. Technical report, Meta AI. OpenAI. 2025. GPT-5.2 System Card. Technical report, OpenAI. Sukannya Purkayastha, Zhuang Li, Anne L...

  2. [2]

    101 languages

    Reviewer bias in single-versus double-blind peer review.Proceedings of the National Academy of Sciences, 114(48):12708–12713. David Tran, Alex Valtchanov, Keshav Ganapathy, Ray- mond Feng, Eric Slud, Micah Goldblum, and Tom Goldstein. 2020. An Open Review of OpenRe- view: A Critical Analysis of the Machine Learn- ing Conference Review Process.arXiv prepri...

  3. [3]

    Analyze Evidence:Look for specific mentions of languages, datasets (infer the language if the dataset is standard, e.g., SQuAD = English), and claims in the abstract/reviews

  4. [4]

    justification

    Filter:Exclude programming languages (Python, Java, etc.) unless the task involves natural-language-to-code translation. 3.Synthesize:Write reasoning into the "justification" field. 4.Output:Generate a valid JSON object. Annotation Rules

  5. [5]

    en", "de

    Naming & Normalization 16 •Explicit Mentions:Output the full English name (no ISO codes). –Bad: "en", "de", "MSA" –Good: "English", "German", "Arabic" – Normalization Map: – "Mandarin", "Putonghua", "Cantonese", "Taiwanese Mandarin", "Simplified Chinese", "Traditional Chi- nese"→"Chinese" – "Modern Standard Arabic", "MSA", "Egyptian Arabic", "Gulf Arabic"...

  6. [6]

    others” implied; list only the named ones. •multilingual-count-only— Only a count given (e.g., “101 languages

    Language Scope Categories Classify each paper into exactly onelanguage_scopecategory: •single-language— One specific language studied. •multilingual-specified— Multiple specific languages listed. •multilingual-partial— Some languages named + “others” implied; list only the named ones. •multilingual-count-only— Only a count given (e.g., “101 languages”). •...

  7. [7]

    • Formultilingual-count-only: Use the stated count (e.g., 101) even thoughlanguagesis empty

    Handling Counts •languages_count: Number of unique languages in thelanguageslist. • Formultilingual-count-only: Use the stated count (e.g., 101) even thoughlanguagesis empty. • Formultilingual-unspecifiedandlanguage-agnostic: Set to 0

  8. [8]

    Python" or

    Defaults & Edge Cases • English Default:If datasets are known to be English (e.g., GLUE, SQuAD, ImageNet) and no other language is mentioned→language_scope:single-language,languages:["English"]. • Language-Agnostic:If the method is purely mathematical/symbolic or appliedonlyto synthetic data/pixels without text→language_scope:language-agnostic,languages:[...

  9. [9]

    They only evaluate on English,

    Priority between evidence sources (title/abstract vs reviews) • Primary evidence = actual experiments and evaluations described (first in abstract, then in reviews if abstract is vague or missing details). 17 • If reviews explicitly describe the evaluated languages (e.g., “They only evaluate on English,” “Experiments are English-only,” “No non-English res...

  10. [10]

    Evidence type priority (choose exactly one) Use the highest-priority category that applies:

  11. [11]

    explicit_list— any specific natural language names are mentioned as being experimentally evaluated (highest priority; overrides everything else)

  12. [12]

    dataset_implied— no explicit language names, but languages can be reliably inferred from standard dataset names

  13. [13]

    101 languages

    count_only— only a number of languages is given (e.g., "101 languages") without names or identifiable datasets

  14. [14]

    multilingual

    claim_only— only vague claims like "multilingual" or "cross-lingual" with no names, datasets, or counts (lowest priority). Output Fields •language_scope : One of: "single-language", "multilingual-specified", "multilingual-partial", "multilingual-count-only","multilingual-unspecified","language-agnostic". •languages: Array of normalized language names (can...