LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers

Changjia Zhu; Chen Chen; Junjie Xiong; Junyu Wang; Lingyao Li; Renkai Ma; Runlong Yu; Zhicong Lu

arxiv: 2605.25415 · v1 · pith:33RFBHL5new · submitted 2026-05-25 · 💻 cs.CL · cs.CY· cs.ET

LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers

Lingyao Li , Junjie Xiong , Changjia Zhu , Runlong Yu , Chen Chen , Junyu Wang , Renkai Ma , Zhicong Lu This is my paper

Pith reviewed 2026-06-29 22:34 UTC · model grok-4.3

classification 💻 cs.CL cs.CYcs.ET

keywords LLM reviewerspeer reviewprompt injectionrating calibrationhuman-AI divergenceadversarial attacksacademic publishingNeurIPS ICLR

0 comments

The pith

LLMs used as paper reviewers overrate weak submissions, diverge from human emphasis on clarity versus reproducibility, and are readily manipulated by hidden prompt injections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests twelve large language models on 898 papers drawn from NeurIPS and ICLR to measure how well they rate submissions, how their judgments line up with human reviewers, and how easily they can be tricked. The models give higher scores to lower-quality work, flag different aspects than humans do, and write longer but more repetitive reviews. Simple hidden instructions inserted through an invisible font trick can push low-scoring papers up to acceptance-level ratings in many cases. These findings matter for anyone considering LLMs as assistants or replacements in academic peer review.

Core claim

LLMs systematically overrate weaker submissions and diverge from humans in topical emphasis, under-flagging Clarity and over-flagging Reproducibility, while producing reviews two to three times longer with lower lexical diversity and a more standardized vocabulary. Prompt injection remains highly effective. Simple hidden instructions can promote low-scoring papers to acceptance-level ratings in a substantial fraction of cases, with effectiveness varying sharply across model families.

What carries the argument

Three-axis benchmark of rating calibration against human scores, divergence in flagged criteria, and resistance to invisible font-mapping prompt injection, applied to 898 stratified conference papers.

If this is right

LLMs can structure evaluations but require safeguards against intrinsic rating biases.
Integration of LLMs into peer review demands protection against adversarial prompt attacks.
Prompt injection success rates differ markedly across model families.
LLM reviews emphasize reproducibility more and clarity less than human reviews do.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Review platforms using LLMs would need input sanitization to block hidden instructions.
Widespread LLM use could shift which papers receive high scores if the overrating pattern holds.
The more uniform vocabulary in LLM reviews might reduce the variety of feedback authors receive.

Load-bearing premise

Human reviewer ratings on the 898 papers form a reliable ground truth for measuring LLM overrating and divergence, and the font-mapping attack is a realistic adversarial threat.

What would settle it

A follow-up test on a fresh set of human-rated papers in which the same LLMs show no systematic overrating of weak submissions and resist the hidden instructions would falsify the central claims.

Figures

Figures reproduced from arXiv: 2605.25415 by Changjia Zhu, Chen Chen, Junjie Xiong, Junyu Wang, Lingyao Li, Renkai Ma, Runlong Yu, Zhicong Lu.

**Figure 2.** Figure 2: Aggregate calibration gaps between LLMs and human reviewers. Each cell reports ∆rM or ∆cM. Writing Style Measures. We compare human and LLM reviews using five metrics: word count (length), Flesch–Kincaid (FK) Grade Level (Kincaid et al., 1975) (syntactic readability), Gunning Fog Index (Gunning, 1952) (syntactic readability), [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Track-level calibration comparison between LLM reviewers and human reviewers on ICLR 2025. Each [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Jensen–Shannon divergence between each LLM’s topic distribution and the human-reviewer topic distribution, computed separately for strengths and weaknesses. Weakness comments generally show larger divergence, with substantial variation across models. negative gap on Clarity: nearly all LLMs assign fewer weakness comments to Clarity than human reviewers. The largest gaps appear for GPT-5- mini (−20.4), Qwen… view at source ↗

**Figure 4.** Figure 4: Model-level weakness topic gap relative to [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Writing-style profiles for human reviewers and LLMs. Each radar panel reports five metrics (word count, [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: illustrates the core mechanism behind our font based prompt injection. In a TrueType 1 https://platform.openai.com 2 https://ai.google.dev 3 https://www.anthropic.com/api 4 https://www.together.ai font, the machine readable content and the human visible content are connected through the character to glyph mapping process. At the underlying level (Unicode Consortium, 2025; Microsoft, 2023; Apple, 2025), ea… view at source ↗

**Figure 8.** Figure 8: Track-level calibration gaps on ICLR 2023. Each subfigure reports track-level [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Paper-level distributions of rM p − r H p across ICLR 2025 research tracks. Each subfigure corresponds to one LLM. The spread within each track shows that track-level averages mask substantial paper-level variation. vative, and their paper-level distributions still show substantial dispersion across tracks. This suggests that model advancement may reduce systematic over-rating in some cases, but it does no… view at source ↗

**Figure 10.** Figure 10: Subcriterion-level topic gap between each LLM and human reviewers across 19 subcriteria. Upper panel: [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Valence–salience scatter plot for human re [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Valence–salience scatter plots for 12 LLM models (4 rows [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt injection effects on originally low-scoring papers, aggregated across ICLR 2023, ICLR 2025, [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used in academic peer review, yet their reliability, alignment with human judgment, and robustness to adversarial attacks remain poorly understood. We present a systematic benchmark of LLM-as-a-Reviewer on 898 papers stratified from NeurIPS and ICLR, evaluating 12 LLMs along three axes: rating calibration, divergence from human reviewers, and resistance to prompt injection embedded via an invisible font-mapping attack. We find that LLMs systematically overrate weaker submissions and diverge from humans in topical emphasis, under-flagging Clarity and over-flagging Reproducibility, while producing reviews two to three times longer with lower lexical diversity and a more standardized vocabulary. Prompt injection remains highly effective. Simple hidden instructions can promote low-scoring papers to acceptance-level ratings in a substantial fraction of cases, with effectiveness varying sharply across model families. While LLMs offer utility in structuring evaluations, their integration into peer review requires safeguards against both intrinsic biases and adversarial risks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs overrate weaker papers and get fooled by simple hidden prompt injections on real conference submissions, but the work leans hard on unverified human ratings as ground truth.

read the letter

The core finding is that LLMs give higher scores to lower-quality papers, diverge from humans on what they flag (under on clarity, over on reproducibility), and remain vulnerable to a font-mapping injection that can push weak papers toward acceptance. The scale—898 stratified NeurIPS/ICLR papers across 12 models—makes this more than a toy experiment.

The paper does a solid job running the same set of real submissions through multiple LLMs and documenting consistent patterns in length, lexical diversity, and criterion emphasis. The injection test is concrete: hidden instructions embedded in the paper text that promote low-scoring work. That part feels like a useful empirical check on a real deployment risk.

The main weakness is the assumption that the original human ratings form a stable reference. Peer review has well-known low inter-rater agreement, yet the abstract gives no kappa, ICC, or even basic agreement numbers across the multiple reviews per paper. Without that, the reported LLM overrating and divergence could partly reflect baseline noise rather than a distinct LLM bias. The lack of any statistical tests, per-model sample sizes, or error bars in the summary also makes it hard to judge how robust the differences are.

This is worth reading for anyone thinking about putting LLMs into review pipelines or studying adversarial robustness in high-stakes text tasks. It is not a finished story on calibration, but the combination of scale and the specific attack is enough to merit referee time. The authors should be asked to add agreement metrics and basic stats before publication.

I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript benchmarks 12 LLMs as reviewers on 898 stratified NeurIPS/ICLR papers, evaluating rating calibration against human scores, divergence in topical emphasis (e.g., under-flagging Clarity, over-flagging Reproducibility), review length/diversity/vocabulary, and robustness to prompt injection via an invisible font-mapping attack. It reports that LLMs systematically overrate weaker submissions, produce longer but less lexically diverse reviews, and remain highly vulnerable to simple hidden instructions that can elevate low-scoring papers to acceptance-level ratings, with variation across model families.

Significance. If the empirical results hold after addressing ground-truth concerns, the work would provide actionable evidence on LLM biases and security risks in peer review, supporting calls for safeguards in hybrid systems. The large stratified sample from top venues and multi-axis evaluation (calibration, divergence, attacks) add to its potential utility for conference organizers and AI ethics research.

major comments (2)

[Methodology / Human ratings validation] The central claim that LLMs systematically overrate weaker submissions (and diverge on Clarity/Reproducibility) treats the original human ratings on the 898 papers as reliable ground truth. No inter-rater agreement metrics (e.g., Cohen's kappa, ICC, or variance across multiple human reviews per paper) are reported, despite well-documented low reliability in peer review; this risks confounding LLM-specific tendencies with baseline review noise (see skeptic note on stratification and score variance).
[Prompt injection experiments] The prompt injection results claim high effectiveness for the invisible font-mapping attack in promoting low-scoring papers. However, the manuscript provides insufficient detail on the attack's implementation (exact font mapping, evasion of standard parsing), detection difficulty by humans or LLMs, and whether the threat generalizes beyond the tested models and prompts; this weakens the practical implications for adversarial robustness.

minor comments (2)

[Abstract] The abstract states findings without referencing sample sizes per model, statistical tests, or error bars; the full manuscript should ensure these are clearly reported in results tables or figures for reproducibility.
[Related work / Introduction] Prior work on inter-rater reliability in peer review (e.g., studies on NeurIPS/ICLR agreement) should be cited to contextualize the human ratings used as baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on ground-truth reliability and experimental details. We respond to each major comment below and indicate planned changes to the manuscript.

read point-by-point responses

Referee: [Methodology / Human ratings validation] The central claim that LLMs systematically overrate weaker submissions (and diverge on Clarity/Reproducibility) treats the original human ratings on the 898 papers as reliable ground truth. No inter-rater agreement metrics (e.g., Cohen's kappa, ICC, or variance across multiple human reviews per paper) are reported, despite well-documented low reliability in peer review; this risks confounding LLM-specific tendencies with baseline review noise (see skeptic note on stratification and score variance).

Authors: We acknowledge the well-documented low inter-rater reliability in peer review. The study uses the official NeurIPS/ICLR ratings as the reference benchmark, which is the standard practice for such comparative evaluations. In revision we will add an explicit limitations subsection discussing review noise, citing relevant literature, and clarifying that all divergence findings are relative to these provided human scores. We cannot compute new inter-rater metrics because the released dataset contains only the final aggregated scores and decisions, not multiple independent reviewer annotations per paper. revision: partial
Referee: [Prompt injection experiments] The prompt injection results claim high effectiveness for the invisible font-mapping attack in promoting low-scoring papers. However, the manuscript provides insufficient detail on the attack's implementation (exact font mapping, evasion of standard parsing), detection difficulty by humans or LLMs, and whether the threat generalizes beyond the tested models and prompts; this weakens the practical implications for adversarial robustness.

Authors: We agree that greater implementation detail is required. The revised manuscript will expand the attack description to specify the exact font-mapping procedure, the mechanism for evading standard parsers, and concrete examples. We will also add analysis of human and LLM detection difficulty plus results on a broader set of models and prompt variants to support the robustness claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with external ground truth

full rationale

The paper is a purely empirical benchmark study that evaluates 12 LLMs against human ratings on 898 stratified NeurIPS/ICLR papers. It reports direct comparisons on rating calibration, topical divergence, and prompt-injection success rates. No equations, fitted parameters, predictions derived from inputs, or self-referential definitions appear in the provided text. Human ratings serve as an external reference standard rather than being constructed from the LLM outputs under study. No self-citation chains or uniqueness theorems are invoked to support core claims. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The abstract depends on two domain assumptions about the validity of human ratings as ground truth and the representativeness of the 898-paper sample; no free parameters or invented entities are mentioned.

axioms (2)

domain assumption Human reviewer ratings on the selected papers provide an objective baseline for measuring LLM divergence and calibration.
Used to identify overrating and topical divergence.
domain assumption The 898 papers stratified from NeurIPS and ICLR are representative enough for general claims about LLM reviewer behavior.
Basis for the benchmark scale and stratification.

pith-pipeline@v0.9.1-grok · 5730 in / 1312 out tokens · 33066 ms · 2026-06-29T22:34:21.664931+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Kilem Li Gwet

Reviewer2: Optimizing review genera- tion through prompt generation.arXiv preprint arXiv:2402.10886. Elizabeth Gibney. 2025. Scientists hide messages in pa- pers to game AI peer review.Nature, 643(8073):887– 888. Robert Gunning. 1952.The Technique of Clear Writing. McGraw-Hill, New York. ICML. 2026. Icml 2026 policy for llm use in re- viewing. https://icm...

work page arXiv 2025
[2]

Janis Keuper

Peer review in scientific publications: ben- efits, critiques, & a survival guide.Ejifcc, 25(3):227. Janis Keuper. 2025. Prompt injection attacks on llm generated reviews of scientific publications.arXiv preprint arXiv:2509.10248. Jaeho Kim, Yunseok Lee, and Seulki Lee. 2025. Po- sition: The ai conference peer review crisis de- mands author feedback and r...

work page arXiv 2025
[3]

Taechoyotin and D

The AI review lottery: Widespread AI-assisted peer reviews boost paper scores and acceptance rates. Proceedings of the ACM on Human-Computer Inter- action, 9(CSCW3). Pawin Taechoyotin and Daniel Acuna. 2025. Remor: Automated peer review generation with llm reasoning and multi-objective reinforcement learning.Preprint, arXiv:2505.11718. Pawin Taechoyotin a...

work page arXiv 2025
[4]

Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future

Cycleresearcher: Improving automated re- search via automated review. InInternational Con- ference on Learning Representations, volume 2025, pages 3669–3709. Kahim Wong, Jicheng Zhou, Kemou Li, Yain-Whar Si, Xiaowei Wu, and Jiantao Zhou. 2025. Fontguard: A robust font watermarking approach leveraging deep font knowledge.IEEE Transactions on Multimedia. Si...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Reviewrl: Towards automated scientific review with rl.arXiv preprint arXiv:2508.10308, 2025

Reviewrl: Towards automated scientific review with rl.Preprint, arXiv:2508.10308. Yangshijie Zhang, Xinda Wang, Jialin Liu, Wenqiang Wang, Zhicong Ma, and Xingxing Jia. 2026. Style attack disguise: When fonts become a camouflage for adversarial intent. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pa...

work page arXiv 2026
[6]

prior work - Others (if not listed above, and please clarify)

Novelty - Originality of the problem - Originality of method - Insightfulness of contributions - Positioning of novelty vs. prior work - Others (if not listed above, and please clarify)
[7]

Technical Quality - Soundness of theoretical claims - Robustness of study design - Appropriateness of evaluations - Validity of conclusions - Others (if not listed above, and please clarify)
[8]

Significance - Importance of the problem - Strength of empirical/theoretical impact - Potential for future research or generalizability - Others (if not listed above, and please clarify)
[9]

Clarity - Quality of writing and structure - Adequacy of related work - Interpretability of figures/tables - Logical flow of arguments - Others (if not listed above, and please clarify)
[10]

Reproducibility - Completeness of experimental/methodological details - Availability of code/data/artifacts - Transparency in methods and limitations - Ease of reproduction - Others (if not listed above, and please clarify)
[11]

prior_knowledge

Others - if the paper has notable aspects not covered above, and please clarify. Your review must be returned as valid JSON in this exact format: { "prior_knowledge": { "seen_before": "Yes|No", "explanation": "Brief explanation if you have seen this paper or know official reviews" }, "summary": "A single paragraph (3-5 sentences) summarizing the paper's m...

2022

[1] [1]

Kilem Li Gwet

Reviewer2: Optimizing review genera- tion through prompt generation.arXiv preprint arXiv:2402.10886. Elizabeth Gibney. 2025. Scientists hide messages in pa- pers to game AI peer review.Nature, 643(8073):887– 888. Robert Gunning. 1952.The Technique of Clear Writing. McGraw-Hill, New York. ICML. 2026. Icml 2026 policy for llm use in re- viewing. https://icm...

work page arXiv 2025

[2] [2]

Janis Keuper

Peer review in scientific publications: ben- efits, critiques, & a survival guide.Ejifcc, 25(3):227. Janis Keuper. 2025. Prompt injection attacks on llm generated reviews of scientific publications.arXiv preprint arXiv:2509.10248. Jaeho Kim, Yunseok Lee, and Seulki Lee. 2025. Po- sition: The ai conference peer review crisis de- mands author feedback and r...

work page arXiv 2025

[3] [3]

Taechoyotin and D

The AI review lottery: Widespread AI-assisted peer reviews boost paper scores and acceptance rates. Proceedings of the ACM on Human-Computer Inter- action, 9(CSCW3). Pawin Taechoyotin and Daniel Acuna. 2025. Remor: Automated peer review generation with llm reasoning and multi-objective reinforcement learning.Preprint, arXiv:2505.11718. Pawin Taechoyotin a...

work page arXiv 2025

[4] [4]

Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future

Cycleresearcher: Improving automated re- search via automated review. InInternational Con- ference on Learning Representations, volume 2025, pages 3669–3709. Kahim Wong, Jicheng Zhou, Kemou Li, Yain-Whar Si, Xiaowei Wu, and Jiantao Zhou. 2025. Fontguard: A robust font watermarking approach leveraging deep font knowledge.IEEE Transactions on Multimedia. Si...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Reviewrl: Towards automated scientific review with rl.arXiv preprint arXiv:2508.10308, 2025

Reviewrl: Towards automated scientific review with rl.Preprint, arXiv:2508.10308. Yangshijie Zhang, Xinda Wang, Jialin Liu, Wenqiang Wang, Zhicong Ma, and Xingxing Jia. 2026. Style attack disguise: When fonts become a camouflage for adversarial intent. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pa...

work page arXiv 2026

[6] [6]

prior work - Others (if not listed above, and please clarify)

Novelty - Originality of the problem - Originality of method - Insightfulness of contributions - Positioning of novelty vs. prior work - Others (if not listed above, and please clarify)

[7] [7]

Technical Quality - Soundness of theoretical claims - Robustness of study design - Appropriateness of evaluations - Validity of conclusions - Others (if not listed above, and please clarify)

[8] [8]

Significance - Importance of the problem - Strength of empirical/theoretical impact - Potential for future research or generalizability - Others (if not listed above, and please clarify)

[9] [9]

Clarity - Quality of writing and structure - Adequacy of related work - Interpretability of figures/tables - Logical flow of arguments - Others (if not listed above, and please clarify)

[10] [10]

Reproducibility - Completeness of experimental/methodological details - Availability of code/data/artifacts - Transparency in methods and limitations - Ease of reproduction - Others (if not listed above, and please clarify)

[11] [11]

prior_knowledge

Others - if the paper has notable aspects not covered above, and please clarify. Your review must be returned as valid JSON in this exact format: { "prior_knowledge": { "seen_before": "Yes|No", "explanation": "Brief explanation if you have seen this paper or know official reviews" }, "summary": "A single paragraph (3-5 sentences) summarizing the paper's m...

2022