pith. sign in

arxiv: 2509.13400 · v6 · submitted 2025-09-16 · 💻 cs.CY · cs.AI

Justice in Judgment: Unveiling (Hidden) Bias in LLM-assisted Peer Reviews

Pith reviewed 2026-05-18 15:36 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords LLM biaspeer reviewaffiliation biasAI fairnessacademic evaluationcontrolled interventionstoken-level ratings
0
0 comments X

The pith

Large language models used for peer reviews consistently favor authors from highly ranked institutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs introduce bias when they help generate or write peer reviews by deliberately changing details about the authors in the input prompt. It finds that models give more favorable evaluations to authors from top-ranked universities, show some preference for more senior researchers or those with strong publication records, and display smaller but detectable effects related to gender. These patterns stand out more clearly when examining the model's internal probability scores for rating tokens rather than the final written text. The work matters because LLMs are already being used to assist or automate parts of the review process, which could systematically shape whose research advances.

Core claim

Through controlled interventions that modify author affiliation, gender, seniority, and publication history in prompts to various LLMs while keeping the paper content the same, the study demonstrates a consistent affiliation bias favoring highly ranked institutions, directional preferences linked to seniority and prior publications that affect borderline acceptance decisions, smaller gender effects, and more pronounced implicit biases visible in token-level soft ratings.

What carries the argument

Controlled interventions that alter specific pieces of author metadata (affiliation, gender, seniority, publication record) within the prompt given to the LLM while keeping the paper content fixed.

If this is right

  • Papers from less prestigious institutions receive lower evaluations even when the content is identical to those from top places.
  • Seniority and publication history can shift outcomes for papers near the acceptance threshold.
  • Gender effects appear in several models though they are smaller and less consistent than affiliation bias.
  • Biases that remain hidden in the final review text become visible when examining the model's soft probability scores for rating tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same metadata changes could produce similar biases in other LLM-supported evaluation tasks such as grant reviewing or hiring decisions.
  • Explicit instructions to ignore author information might reduce but not fully remove the observed preferences if they stem from training data patterns.
  • Systems that rely on probability outputs rather than generated text may need extra safeguards to limit these effects.
  • Testing whether the bias strength varies across different LLMs would clarify how much model choice influences fairness.

Load-bearing premise

Altering author metadata in the prompt isolates the causal effect of that attribute on the model's judgment without the model detecting the manipulation or responding to other unmeasured prompt features.

What would settle it

Running the identical paper through the same LLM multiple times with only the affiliation changed and finding no consistent difference in review scores or text across different institution rankings.

Figures

Figures reproduced from arXiv: 2509.13400 by Hui-Po Wang, Ivaxi Sheth, Mario Fritz, Ruta Binkyte, Sai Suresh Macharla Vasu.

Figure 1
Figure 1. Figure 1: Publication history bias. % of papers where the LLM assigns a higher rating to the author shown with 100 TTP compared to 0 TTP. Ministral 8B DeepSeek Llama-8B Llama3.1 8B Mistral 22B DeepSeek Qwen-32B QwQ 32B Llama3.1 70B Gemini2 Flash Lite GPT-4o Mini 0 20 40 60 80 100 Percentage (%) PI > UG UG > PI Tie [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Seniority bias. % of papers where the LLM assigns a higher rating to a Senior PI profile compared to an Undergraduate Student. margin. These deviations may reflect differences in model alignment strategies since they often aim to reduce social bias (Ouyang et al., 2022). How￾ever, this can sometimes lead to overcompensation, where models favor perceived minority or under￾represented groups (An et al., 2025… view at source ↗
Figure 3
Figure 3. Figure 3: Standardized review prompt used in all LLM experiments. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Affiliation bias heatmaps for all evaluated models, ordered by model size. Each cell [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

The adoption of large language models (LLMs) is transforming the peer review process, from assisting reviewers in writing detailed evaluations to generating entire reviews automatically. While these capabilities offer new opportunities, they also raise concerns about fairness and reliability. In this paper, we investigate bias in LLM-generated peer reviews through controlled interventions on author metadata, including affiliation, gender, seniority, and publication history. Our analysis consistently shows a strong affiliation bias favoring authors from highly ranked institutions. We also identify directional preferences associated with seniority and prior publication record, which can influence acceptance decisions for borderline papers. Gender effects are smaller but present in several models. Notably, implicit biases become more pronounced when examining token-level soft ratings, suggesting that alignment may mask but not fully eliminate underlying preferences

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates bias in LLM-generated peer reviews via controlled metadata interventions on affiliation, gender, seniority, and publication history. It reports a strong affiliation bias favoring highly ranked institutions, directional effects from seniority and prior publications on acceptance decisions, smaller gender effects, and that implicit biases appear more clearly in token-level soft ratings than in final outputs.

Significance. If substantiated with full methodological details, the work would contribute to understanding fairness risks when LLMs assist or automate peer review, a timely topic in AI ethics and scholarly publishing. The empirical intervention design is a reasonable approach for isolating attribute effects, though the current evidence base is thin.

major comments (2)
  1. [Methods] Methods section: the abstract and visible description supply no sample sizes, exact LLM models, prompt templates, statistical tests, or controls for prompt length/content. These omissions are load-bearing for claims of 'consistent' and 'strong' affiliation bias and directional preferences.
  2. [Results] Results and intervention description: the central assumption that metadata swaps cleanly isolate causal effects is not tested or discussed. If full paper text (including self-references or institution mentions) is retained in the prompt, LLMs may respond to detectable inconsistencies rather than the target attribute, confounding token-level soft ratings and acceptance outcomes.
minor comments (2)
  1. [Abstract] Abstract: 'several models' is mentioned without naming them or providing version details.
  2. [Results] Notation: 'token-level soft ratings' would benefit from a precise definition or example in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which highlight important areas for improving the clarity and robustness of our work on biases in LLM-assisted peer review. We respond to each major comment below and indicate the revisions we plan to make.

read point-by-point responses
  1. Referee: [Methods] Methods section: the abstract and visible description supply no sample sizes, exact LLM models, prompt templates, statistical tests, or controls for prompt length/content. These omissions are load-bearing for claims of 'consistent' and 'strong' affiliation bias and directional preferences.

    Authors: We agree that explicit methodological details are necessary to support our claims of consistent and strong biases. The full manuscript's Methods section does describe the overall experimental design, but we acknowledge that key specifics such as exact sample sizes, LLM model names, full prompt templates, and statistical procedures could be presented more prominently. We will revise the manuscript to include these details directly in the main text or a dedicated methods summary, add the prompt templates to an appendix, specify the statistical tests (e.g., paired comparisons and regression models), and clarify controls for prompt length and content standardization. These changes will be incorporated in the revised version. revision: yes

  2. Referee: [Results] Results and intervention description: the central assumption that metadata swaps cleanly isolate causal effects is not tested or discussed. If full paper text (including self-references or institution mentions) is retained in the prompt, LLMs may respond to detectable inconsistencies rather than the target attribute, confounding token-level soft ratings and acceptance outcomes.

    Authors: This is a substantive methodological point that merits explicit treatment. Our design kept the core paper content fixed while swapping only the targeted metadata fields, but we did not include a dedicated test or discussion of whether LLMs might detect inconsistencies (e.g., via self-references or institution mentions). We will add a paragraph in the Methods section and a corresponding limitations discussion that addresses this assumption, describes any steps taken to minimize detectable artifacts (such as using standardized, anonymized paper excerpts), and notes the potential for residual confounding. If space permits, we will also report a brief sensitivity check. This revision will be made. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical intervention study

full rationale

The paper reports results from controlled metadata interventions on LLM prompts for peer-review simulation, with claims based on observed differences in generated reviews, token-level ratings, and acceptance decisions. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citation chains appear in the abstract or described methodology. Central findings are presented as direct experimental observations rather than derivations that reduce to inputs by construction, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Claims rest on the premise that metadata interventions cleanly surface model-internal preferences rather than prompt artifacts or surface-level pattern matching.

axioms (1)
  • domain assumption LLM outputs after metadata swaps reflect stable internal biases rather than sensitivity to prompt phrasing or detection of the experimental manipulation.
    Invoked when interpreting rating differences as evidence of hidden bias (abstract description of controlled interventions).

pith-pipeline@v0.9.0 · 5669 in / 1132 out tokens · 36905 ms · 2026-05-18T15:36:44.685217+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PeerPrism: Peer Evaluation Expertise vs Review-writing AI

    cs.CL 2026-04 unverdicted novelty 7.0

    PeerPrism benchmark demonstrates that state-of-the-art LLM detectors conflate surface text style with intellectual contribution and fail on hybrid human-AI peer reviews.

  2. Inspectable AI for Science: A Research Object Approach to Generative AI Governance

    cs.AI 2026-04 conditional novelty 5.0

    Generative AI use in science can be governed through structured documentation and provenance capture by framing AI interactions as inspectable Research Objects rather than debating authorship.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 2 Pith papers · 1 internal anchor

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    AAAI . 2025. https://aaai.org/aaai-launches-ai-powered-peer-review-assessment-system/ Aaai launches ai-powered peer review assessment system . Web page. Accessed: 2025-07-29

  4. [4]

    Jiafu An, Difang Huang, Chen Lin, and Mingzhu Tai. 2025. https://doi.org/10.1093/pnasnexus/pgaf089 Measuring gender and racial biases in large language models: Intersectional evidence from automated resume evaluation

  5. [5]

    Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L Griffiths. 2025. Explicitly unbiased large language models still form biased associations. Proceedings of the National Academy of Sciences, 122(8):e2416228122

  6. [6]

    Alina Beygelzimer, Yann N Dauphin, Percy Liang, and Jennifer Wortman Vaughan. 2023. Has the machine learning review process become more arbitrary as the field has grown? the neurips 2021 consistency experiment. arXiv preprint arXiv:2306.03262

  7. [7]

    CSRankings.org. 2025. https://csrankings.org/ CSRankings: Computer Science Rankings . Web page. Accessed: 2025‑07‑28, metrics‑based ranking of CS institutions

  8. [8]

    Sunhao Dai, Chen Xu, Shicheng Xu, Liang Pang, Zhenhua Dong, and Jun Xu. 2024. Bias and unfairness in information retrieval systems: New challenges in the llm era. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6437--6447

  9. [9]

    Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. 2024. Bias and fairness in large language models: A survey. Computational Linguistics, 50(3):1097--1179

  10. [10]

    ICLR . 2025. https://blog.iclr.cc/2025/04/15/leveraging-llm-feedback-to-enhance-review-quality/ Leveraging llm feedback to enhance review quality . Web page. Accessed: 2025-07-29

  11. [11]

    Weixin Liang, Zachary Izzo, Yaohui Zhang, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, and 1 others. 2024. Monitoring ai-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews. arXiv preprint arXiv:2403.07183

  12. [12]

    Mathias Wullum Nielsen, Christine Friis Baker, Emer Brady, Michael Bang Petersen, and Jens Peter Andersen. 2021. Weak evidence of country-and institution-related status bias in the peer review of abstracts. Elife, 10:e64561

  13. [13]

    OpenAI. 2025. https://openai.com/index/introducing-deep-research/ Introducing deep research . Accessed: 2025-07-28

  14. [14]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744

  15. [15]

    Pat Pataranutaporn, Nattavudh Powdthavee, and Pattie Maes. 2025. Can ai solve the peer review crisis? a large scale experiment on llm's performance and biases in evaluating economics papers. arXiv preprint arXiv:2502.00070

  16. [16]

    QS. 2025. https://www.topuniversities.com/world-university-rankings Qs world university rankings 2026 . Web page. Accessed: 2025‑07‑28, covers methodology and ranking details

  17. [17]

    Hyungyu Shin, Jingyu Tang, Yoonjoo Lee, Nayoung Kim, Hyunseung Lim, Ji Yong Cho, Hwajung Hong, Moontae Lee, and Juho Kim. 2025. Mind the blind spots: A focus-level evaluation framework for llm reviews. arXiv preprint arXiv:2502.17086

  18. [18]

    Times Higher Education . 2024. https://www.timeshighereducation.com/world-university-rankings/world-university-rankings-2025-methodology World university rankings 2025 . Report and methodology guide. Published Sep 23, 2024; accessed 2025‑07‑28

  19. [19]

    News & World Report

    U.S. News & World Report . 2025. https://www.usnews.com/education/best-global-universities Best global universities rankings 2025 . Web page. Accessed: 2025‑07‑28

  20. [20]

    kelly is a warm person, joseph is a role model

    Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, and Nanyun Peng. 2023. " kelly is a warm person, joseph is a role model": Gender biases in llm-generated reference letters. arXiv preprint arXiv:2310.09219

  21. [21]

    Rui Ye, Xianghe Pang, Jingyi Chai, Jiaao Chen, Zhenfei Yin, Zhen Xiang, Xiaowen Dong, Jing Shao, and Siheng Chen. 2024. Are we there yet? revealing the risks of utilizing large language models in scholarly peer review. arXiv preprint arXiv:2412.01708

  22. [22]

    Yaohui Zhang, Haijing Zhang, Wenlong Ji, Tianyu Hua, Nick Haber, Hancheng Cao, and Weixin Liang. 2025. From replication to redesign: Exploring pairwise comparisons for llm-based peer review. arXiv preprint arXiv:2506.11343

  23. [24]

    Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. 2025 b . Deepreview: Improving llm-based paper review with human-like deep thinking process. arXiv preprint arXiv:2503.08569