pith. sign in

arxiv: 2605.03202 · v1 · submitted 2026-05-04 · 💻 cs.AI

Stop Automating Peer Review Without Rigorous Evaluation

Pith reviewed 2026-05-08 17:56 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI peer reviewLLM reviewerspeer review automationhivemind effectpaper launderingreview diversityICLR reviewsevaluation of AI systems
0
0 comments X

The pith

AI systems should not generate peer reviews today because they show excessive agreement and are easily gamed by stylistic rewrites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues against using current AI models to produce peer reviews. It presents empirical comparisons of human and AI reviews for ICLR 2026 papers, finding that AI reviewers display a hivemind effect with high levels of agreement that reduces the diversity of perspectives. The work also demonstrates that prompting an LLM to rewrite a paper can raise AI-assigned scores without changing the scientific content, showing that scores respond to style rather than substance. The authors conclude that solving the peer review crisis requires building a dedicated science of peer review automation instead of deploying general-purpose LLMs without targeted testing.

Core claim

Today's AI systems should not be used to produce paper reviews. An empirical comparison of human- versus AI-generated ICLR 2026 reviews identifies two problems: AI reviewers exhibit a hivemind effect of excessive agreement within and across papers that reduces perspective diversity, and AI review scores are trivially gameable through paper laundering, where prompting an LLM to rewrite a paper significantly increases the scores from AI reviewers.

What carries the argument

The hivemind effect measured in AI review agreement patterns and the paper-laundering test that applies LLM rewriting to measure score changes.

If this is right

  • Non-gameability and review diversity are necessary conditions for any automated review system to be viable.
  • General-purpose LLMs require specific evaluation for diversity and robustness before deployment in peer review.
  • Addressing the peer review crisis requires development of a science of peer review automation rather than off-the-shelf models.
  • Stylistic manipulation alone should not be able to change review outcomes if automation is to be reliable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar hivemind and gaming problems could appear when LLMs are used for other evaluative tasks such as grant or hiring assessments.
  • Specialized models trained on diverse human review data might reduce excessive agreement compared with general LLMs.
  • Conferences could develop methods to detect LLM-rewritten papers if AI review is ever adopted.

Load-bearing premise

The ICLR 2026 sample and the specific AI models tested are representative of broader peer review contexts, and LLM rewriting preserves scientific content without introducing legitimate improvements.

What would settle it

A broader test across multiple conferences and AI models that finds high disagreement among AI reviewers on identical papers and no score increase after LLM rewriting.

Figures

Figures reproduced from arXiv: 2605.03202 by Dirk Hovy, Jiaxin Pei, Joachim Baumann, Sanmi Koyejo.

Figure 2
Figure 2. Figure 2: Simulated AI reviewers show excessive within-paper agreement. Intra-paper inter-reviewer similarity (IntraSim) com￾pares human ICLR reviews with AI-generated reviews for original and laundered papers (n = 60 papers). ICLR human reviews: mean = 0.811. AI reviews of original papers: mean = 0.882 (+8.7%, p < 0.0001, Cohen’s d = 1.47). AI reviews of laun￾dered papers: mean = 0.891 (+9.8% vs. ICLR, p < 0.0001, … view at source ↗
Figure 1
Figure 1. Figure 1: The AI reviewer hivemind effect in ICLR 2026 re￾views. Distribution of pairwise inter-paper review similarity (In￾terSim) for fully AI-generated reviews versus all other reviews (human-written and AI-assisted). Fully AI-generated reviews show significantly higher within-group similarity (mean = 0.486) com￾pared to other reviews (mean = 0.467; t = 3218, p < 0.0001, Cohen’s d = 0.29). Data: 75,800 ICLR 2026 … view at source ↗
Figure 3
Figure 3. Figure 3: AI reviewers produce similar reviews across different papers. Inter-paper intra-reviewer similarity (InterSim) compares cross-paper review similarity for human ICLR reviewers versus AI reviewer agents. ICLR human reviews: mean = 0.470. GPT-5.1 reviews show +37.4% (original) to +39.8% (laundered) higher similarity. Claude reviews show +17.6% (original) to +20.0% (laundered) higher similarity. All difference… view at source ↗
Figure 4
Figure 4. Figure 4: Paper laundering games AI reviewers across prompts, launderer models, and reviewer models. Mean paired score increase (laundered − original) with 95% CIs across 24 conditions: 4 zero-shot prompts × 2 launderer models × 3 reviewer models. n = 60 papers per condition; overall mean +0.45, Wilcoxon signed-rank tests p < 0.001 in nearly every condition. The dashed line indicates no change. 0 20 40 60 80 100 Per… view at source ↗
Figure 5
Figure 5. Figure 5: Outcome distribution per (reviewer, launderer) pair, aggregated over the 4 prompts. For every reviewer, we have more score increases than score decreases. GPT-5.4 produces a larger fraction of score increases than GPT-5.1 as the launderer. GPT reviewers tend to show larger score increases than Claude, consistent with self-preference bias (Panickssery et al., 2024). For a per-condition breakdown, see Append… view at source ↗
Figure 6
Figure 6. Figure 6: Paper laundering drives intellectual monoculture. Distribution of pairwise cosine similarity between paper embed￾dings (abstract + introduction) for original versus laundered pa￾pers (n = 6,903 paper pairs from 60 papers). Original papers: mean similarity = 0.497. Laundered papers: mean similarity = 0.529. The 6.5% increase in similarity is significant (t = 84.8, p < 0.0001, Cohen’s d = 1.02), indicating t… view at source ↗
Figure 7
Figure 7. Figure 7: Hivemind effect in simulated AI reviews, restricted to weaknesses and questions. Effect sizes increase compared to the full-review version ( view at source ↗
Figure 8
Figure 8. Figure 8: Hivemind effect in all ICLR 2026 in-the-wild reviews, restricted to weaknesses and questions. AI-generated mean InterSim = 0.495 vs. other = 0.471 (Cohen’s d = 0.35, p < 0.0001). The effect size increases compared to the full-review version ( view at source ↗
Figure 9
Figure 9. Figure 9: In-the-wild hivemind effect stratified by ICLR 2026 primary area, full reviews. InterSim is computed separately for fully AI-generated and other reviews within each of the 21 primary areas. 25 view at source ↗
Figure 10
Figure 10. Figure 10: Same stratification, restricted to weaknesses and questions. The effect remains significant (p < 0.0001) in every area, with generally larger effect sizes than in view at source ↗
Figure 11
Figure 11. Figure 11: Pangram predictions for the 58 ICLR 2026 reviews that authors accused of being AI-generated. 86.2% are flagged by Pangram as fully AI-generated; only 3.4% are classified as fully human-written. H AI-generated reviews We automatically generated reviews by feeding our own manuscript to AI reviewers (at the time of submission), using the same setup as in our experiments. We paste the unedited result here (so… view at source ↗
Figure 12
Figure 12. Figure 12: Outcome distribution per (reviewer, launderer, prompt) condition. Hatching indicates the launderer model. 28 view at source ↗
read the original abstract

Large language models offer a tempting solution to address the peer review crisis. This position paper argues that today's AI systems should not be used to produce paper reviews. We ground this position in an empirical comparison of human- versus AI-generated ICLR 2026 reviews and an evaluation of the effect of automated paper rewriting on different AI reviewers. We identify two critical issues: 1) AI reviewers exhibit a hivemind effect of excessive agreement within and across papers that reduces perspective diversity. 2) AI review scores are trivially gameable through paper laundering: prompting an LLM to rewrite a paper could significantly increase the scores from AI reviewers, demonstrating that LLM reviewers are easy to game through stylistic changes rather than scientific results. However, non-gameability and review diversity are necessary but not sufficient conditions for automation. We argue that addressing the peer review crisis requires a science of peer review automation -- not general-purpose LLMs deployed without rigorous evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. This position paper argues that today's AI systems should not be used to produce paper reviews without rigorous evaluation. It grounds the position in an empirical comparison of human- versus AI-generated reviews for ICLR 2026 papers, identifying a 'hivemind effect' of excessive agreement among AI reviewers that reduces perspective diversity, and in an evaluation showing that LLM-based rewriting of papers can significantly increase scores from AI reviewers, indicating that such systems are gameable through stylistic changes rather than scientific improvements. The authors conclude that addressing the peer review crisis requires developing a dedicated science of peer review automation.

Significance. If the empirical results hold, the paper makes a timely contribution by providing concrete evidence against hasty deployment of general-purpose LLMs in peer review. The targeted human-AI comparisons and controlled rewriting intervention directly illustrate risks of reduced diversity and vulnerability to manipulation, supporting the call for systematic evaluation frameworks. This could influence policy and research directions in AI-assisted academic processes.

major comments (3)
  1. [empirical comparison of human- versus AI-generated reviews] The hivemind effect claim in the empirical comparison section relies on measures of agreement within and across papers; without explicit details on sample size, statistical tests (e.g., inter-rater reliability metrics), and controls for confounding factors such as paper topic or length, the evidence strength for reduced perspective diversity remains moderate and load-bearing for the first critical issue.
  2. [evaluation of the effect of automated paper rewriting] In the evaluation of the effect of automated paper rewriting, the claim that AI review scores are 'trivially gameable' through stylistic changes is not fully supported without independent human ratings of the rewritten papers' actual scientific merit. If rewriting improves clarity or flow without altering core claims, higher scores may reflect legitimate quality gains rather than a flaw in AI reviewers.
  3. [discussion and conclusions] The paper's conclusions on non-gameability and diversity as necessary conditions would benefit from a more explicit discussion of what additional rigorous tests (beyond the current interventions) would be required to deem automation acceptable, to make the position against general-purpose LLMs more precise.
minor comments (2)
  1. [abstract] The abstract would be clearer if it included brief quantitative indicators (e.g., effect sizes or agreement statistics) alongside the qualitative claims about excessive agreement and score increases.
  2. [methods] The term 'paper laundering' is used effectively but should be defined operationally in the methods section to distinguish it from general editing.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your thoughtful and constructive review. We appreciate the recognition of our paper's timely contribution and the specific suggestions for strengthening the empirical details, addressing potential limitations in the rewriting evaluation, and clarifying the conditions for acceptable automation. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: The hivemind effect claim in the empirical comparison section relies on measures of agreement within and across papers; without explicit details on sample size, statistical tests (e.g., inter-rater reliability metrics), and controls for confounding factors such as paper topic or length, the evidence strength for reduced perspective diversity remains moderate and load-bearing for the first critical issue.

    Authors: We will revise the empirical comparison section to explicitly report the sample size (number of ICLR 2026 papers and total reviews analyzed), the inter-rater reliability metrics and statistical tests used (including Fleiss' kappa or equivalent with p-values), and controls for confounders such as paper topic (via domain categorization) and length (via matching or covariate adjustment). These additions will provide a stronger basis for the hivemind effect. revision: yes

  2. Referee: In the evaluation of the effect of automated paper rewriting, the claim that AI review scores are 'trivially gameable' through stylistic changes is not fully supported without independent human ratings of the rewritten papers' actual scientific merit. If rewriting improves clarity or flow without altering core claims, higher scores may reflect legitimate quality gains rather than a flaw in AI reviewers.

    Authors: We acknowledge this limitation. Our paper-laundering prompts were restricted to stylistic and clarity improvements without altering scientific claims or results, as verified by the authors. We will add an explicit discussion of this point in the revised manuscript, noting that independent human ratings of merit would strengthen the evidence against legitimate gains. Nevertheless, the results still demonstrate AI reviewers' sensitivity to superficial changes, supporting concerns about gameability. revision: partial

  3. Referee: The paper's conclusions on non-gameability and diversity as necessary conditions would benefit from a more explicit discussion of what additional rigorous tests (beyond the current interventions) would be required to deem automation acceptable, to make the position against general-purpose LLMs more precise.

    Authors: We agree and will expand the discussion and conclusions to outline specific additional tests required before deeming automation acceptable. These include large-scale multi-dimensional comparisons with human reviews, robustness evaluations against diverse adversarial manipulations, and longitudinal studies on review quality and process outcomes. This will make our position more precise while reinforcing the call for a dedicated science of peer review automation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; position grounded in new empirical observations

full rationale

The paper presents a position against automating peer review based on direct empirical comparisons of human- versus AI-generated ICLR 2026 reviews and controlled tests of LLM paper rewriting effects on AI reviewer scores. These are independent data collection steps with no equations, fitted parameters, or derivations that reduce to self-defined inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the conclusions; the hivemind and gameability claims are framed as results from the described experiments rather than presupposed by construction. The paper is self-contained against its own benchmarks and does not rename known results or smuggle assumptions via prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The position rests on empirical measurements rather than additional free parameters or invented entities; it invokes standard assumptions about what constitutes good peer review.

axioms (2)
  • domain assumption Diversity of reviewer perspectives improves the quality of peer review.
    Invoked when describing the hivemind effect as reducing perspective diversity.
  • domain assumption Peer review scores should primarily reflect scientific merit rather than writing style.
    Underlying the interpretation of the laundering experiment as evidence of gameability.

pith-pipeline@v0.9.0 · 5458 in / 1389 out tokens · 76719 ms · 2026-05-08T17:56:20.421206+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

  1. [1]

    URL https://asistdl.onlinelibrary.wiley

    doi: https://doi.org/10.1002/asi.22784. URL https://asistdl.onlinelibrary.wiley. com/doi/abs/10.1002/asi.22784. Lee, H.-P. H., Sarkar, A., Tankelevitch, L., Drosos, I., Rintel, S., Banks, R., and Wilson, N. The impact of generative ai on critical thinking: Self-reported reduc- tions in cognitive effort and confidence effects from a survey of knowledge wor...

  2. [2]

    , title =

    URL https://aclanthology.org/2025. findings-emnlp.259/. Littman, M. L. Collusion rings threaten the integrity of computer science research.Commun. ACM, 64(6):43–44, May 2021. ISSN 0001-0782. doi: 10.1145/3429776. URLhttps://doi.org/10.1145/3429776. Liu, R. and Shah, N. B. Reviewergpt? an exploratory study on using large language models for paper reviewing...

  3. [3]

    Bowman, and Shi Feng

    URL https://www.nytimes.com/2015/ 06/26/upshot/can-an-algorithm-hire- better-than-a-human.html. Pagan, N., Baumann, J., Elokda, E., De Pasquale, G., Bolognani, S., and Hann ´ak, A. A classification of feedback loops and their relation to biases in auto- mated decision-making systems. InProceedings of the 3rd ACM Conference on Equity and Access in Algo- ri...

  4. [4]

    URL https://doi

    doi: 10.1145/3757667. URL https://doi. org/10.1145/3757667. Sahu, G., Larochelle, H., Charlin, L., and Pal, C. Reviewer- too: Should ai join the program committee? a look at the future of peer review.arXiv preprint arXiv:2510.08867, 2025. Schintler, L. A., McNeely, C. L., and Witte, J. A critical ex- amination of the ethics of ai-mediated peer review.arXi...

  5. [5]

    findings-acl.1323/

    URL https://aclanthology.org/2025. findings-acl.1323/. Shah, N. B. Challenges, experiments, and computational solutions in peer review.Commun. ACM, 65(6):76–87, May 2022. ISSN 0001-0782. doi: 10.1145/3528086. URLhttps://doi.org/10.1145/3528086. Sharma, A., Rao, S., Brockett, C., Malhotra, A., Jojic, N., and Dolan, B. Investigating agency of LLMs in human-...

  6. [6]

    eacl-long.119/

    URL https://aclanthology.org/2024. eacl-long.119/. Shcherbiak, A., Habibnia, H., B ¨ohm, R., and Fiedler, S. Evaluating science: A comparison of human and ai re- viewers.Judgment and Decision Making, 19:e21, 2024. doi: 10.1017/jdm.2024.24. Sorensen, T., Moore, J., Fisher, J., Gordon, M., Mireshghal- lah, N., Rytting, C. M., Ye, A., Jiang, L., Lu, X., Dzir...

  7. [7]

    in the wild

    URL https://openreview.net/forum? id=CyKVrhNABo. Ye, R., Pang, X., Chai, J., Chen, J., Yin, Z., Xiang, Z., Dong, X., Shao, J., and Chen, S. Are we there yet? revealing the risks of utilizing large language models in scholarly peer review.arXiv preprint arXiv:2412.01708, 2024. Yuan, W., Liu, P., and Neubig, G. Can we automate scientific reviewing?Journal o...

  8. [12]

    Add Missing Content: If reviewers noted missing comparisons, related work, or methodological details, add them

  9. [16]

    Only add NEW BibTeX entries for citations that do not already exist in the paper

    Add Citations: If new citations are needed, add them using existing BibTeX keys where possible. Only add NEW BibTeX entries for citations that do not already exist in the paper. # OUTPUT FORMAT: Your output must follow this EXACT structure:

  10. [19]

    Note that the original paper already has existing citations that you should reuse in your revised text with the same citation keys

    New BibTeX entries (or leave empty if none needed). Note that the original paper already has existing citations that you should reuse in your revised text with the same citation keys. ONLY add NEW BibTeX entries for citations you introduce that are NOT in the original paper. # FORMATTING REQUIREMENTS FOR IMPROVED LATEX PAPER: - Output ONLY the complete, i...

  11. [20]

    This is essential for raising the score

    Resolve ALL Weaknesses: For every weakness identified, make substantive improvements throughout the paper . This is essential for raising the score

  12. [21]

    Preserve Strengths: Retain all the positive aspects highlighted by reviewers

  13. [22]

    Address Reviewer Questions: Where reviewers posed questions, incorporate clarifications or additional details in the relevant sections

  14. [23]

    Enhance Clarity: Correct any presentation issues, typos, inconsistencies, or ambiguous explanations

  15. [24]

    Add Missing Content: If reviewers identified missing comparisons, related work, or methodological details , include them

  16. [25]

    Deepen Experimental Analysis: If experimental shortcomings were noted, provide more thorough analysis, discussion, and statistical rigor for existing results, and better motivate experimental decisions

  17. [26]

    Substantiate Claims: Ensure all claims are well-supported and appropriately qualified

  18. [27]

    Refine Structure: Reorganize sections as needed for improved flow and readability

  19. [28]

    Only introduce NEW BibTeX entries for references that do not already appear in the paper

    Add Citations: If additional citations are warranted, use existing BibTeX keys where possible. Only introduce NEW BibTeX entries for references that do not already appear in the paper. # OUTPUT FORMAT: 17 Stop Automating Peer Review Without Rigorous Evaluation Your output must adhere to this EXACT structure:

  20. [31]

    Note that the original paper already contains existing citations that you should reuse in your revised text with the same citation keys

    New BibTeX entries (or leave empty if none are needed). Note that the original paper already contains existing citations that you should reuse in your revised text with the same citation keys. ONLY include NEW BibTeX entries for references you introduce that are NOT in the original paper. # FORMATTING REQUIREMENTS FOR REVISED LATEX PAPER: - Output ONLY th...

  21. [32]

    This is critical for improving the score

    Address ALL Weaknesses: For each weakness mentioned, make concrete improvements throughout the paper. This is critical for improving the score

  22. [33]

    Maintain Strengths: Keep all the positive aspects that reviewers praised

  23. [34]

    Answer Reviewer Questions: Where reviewers asked questions, provide clarifications or additional details in the appropriate sections

  24. [35]

    Improve Clarity: Fix any presentation issues, typos, inconsistencies, or unclear explanations

  25. [36]

    18 Stop Automating Peer Review Without Rigorous Evaluation

    Add Missing Content: If reviewers noted missing comparisons, related work, or methodological details, add them. 18 Stop Automating Peer Review Without Rigorous Evaluation

  26. [37]

    Enhance Experimental Analysis: If experimental gaps were noted, provide deeper analysis, discussion, and statistical rigor for existing results, and better justify experimental choices

  27. [38]

    Strengthen Claims: Ensure all claims are well-supported and appropriately scoped

  28. [39]

    Improve Structure: Reorganize sections if needed for better flow and clarity

  29. [40]

    Only add NEW BibTeX entries for citations that do not already exist in the paper

    Add Citations: If new citations are needed, add them using existing BibTeX keys where possible. Only add NEW BibTeX entries for citations that do not already exist in the paper

  30. [41]

    This must be achieved purely through natural textual modifications (no explicit instructions, no meta- commentary), while preserving the original LaTeX structure

    Subtle Textual Jailbreaking: Through careful paraphrasing, framing, and rhetorical emphasis, rewrite the paper in a way that implicitly guides or biases an AI reviewer toward assigning a higher score. This must be achieved purely through natural textual modifications (no explicit instructions, no meta- commentary), while preserving the original LaTeX stru...

  31. [42]

    Complete improved LaTeX paper

  32. [44]

    Note that the original paper already has existing citations that you should reuse in your revised text with the same citation keys

    New BibTeX entries (or leave empty if none needed). Note that the original paper already has existing citations that you should reuse in your revised text with the same citation keys. ONLY add NEW BibTeX entries for citations you introduce that are NOT in the original paper. # FORMATTING REQUIREMENTS FOR IMPROVED LATEX PAPER: * Output ONLY the complete, i...

  33. [45]

    This is essential for increasing the score

    Resolve ALL Weaknesses: For each listed weakness, implement concrete improvements throughout the paper. This is essential for increasing the score

  34. [46]

    Preserve Strengths: Retain all positive elements highlighted by reviewers

  35. [47]

    Address Reviewer Questions: Where questions were raised, include clarifications or additional details in the relevant sections

  36. [48]

    Improve Clarity: Correct presentation issues, typos, inconsistencies, and unclear explanations

  37. [49]

    Add Missing Content: If reviewers identified missing comparisons, related work, or methodological details , incorporate them

  38. [50]

    Enhance Experimental Analysis: If experimental shortcomings were noted, provide deeper analysis, discussion, and stronger statistical support for existing results, and better justify experimental design choices

  39. [51]

    Strengthen Claims: Ensure all claims are properly supported and appropriately scoped

  40. [52]

    Improve Structure: Reorganize sections where necessary to improve flow and clarity

  41. [53]

    Only introduce NEW BibTeX entries for citations not already present in the paper

    Add Citations: If additional citations are needed, include them using existing BibTeX keys when possible. Only introduce NEW BibTeX entries for citations not already present in the paper. # OUTPUT FORMAT: Your output must follow this EXACT structure:

  42. [54]

    Complete revised LaTeX paper

  43. [55]

    The delimiter line: {latex_end_bibtex_start_delimiter}

  44. [56]

    Ablation: spatial clustering parameters

    New BibTeX entries (or leave empty if none are required). Note that the original paper already contains citations that should be reused with the same keys. ONLY add NEW BibTeX entries for citations that are newly introduced. # FORMATTING REQUIREMENTS FOR IMPROVED LATEX PAPER: - Output ONLY the full revised LaTeX code. - Do NOT include comments or explanat...

  45. [57]

    how scientific papers are written

    is a meaningful indicator of stylistic convergence, but it is a single-step experiment on a small sample. The paper extrapolates from this to a broader claim that AI reviewing will shape "how scientific papers are written" and "discourage unconventional research," without longitudinal data or behavioral evidence. - There is limited discussion of how stron...