Stop Automating Peer Review Without Rigorous Evaluation
Pith reviewed 2026-05-08 17:56 UTC · model grok-4.3
The pith
AI systems should not generate peer reviews today because they show excessive agreement and are easily gamed by stylistic rewrites.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Today's AI systems should not be used to produce paper reviews. An empirical comparison of human- versus AI-generated ICLR 2026 reviews identifies two problems: AI reviewers exhibit a hivemind effect of excessive agreement within and across papers that reduces perspective diversity, and AI review scores are trivially gameable through paper laundering, where prompting an LLM to rewrite a paper significantly increases the scores from AI reviewers.
What carries the argument
The hivemind effect measured in AI review agreement patterns and the paper-laundering test that applies LLM rewriting to measure score changes.
If this is right
- Non-gameability and review diversity are necessary conditions for any automated review system to be viable.
- General-purpose LLMs require specific evaluation for diversity and robustness before deployment in peer review.
- Addressing the peer review crisis requires development of a science of peer review automation rather than off-the-shelf models.
- Stylistic manipulation alone should not be able to change review outcomes if automation is to be reliable.
Where Pith is reading between the lines
- Similar hivemind and gaming problems could appear when LLMs are used for other evaluative tasks such as grant or hiring assessments.
- Specialized models trained on diverse human review data might reduce excessive agreement compared with general LLMs.
- Conferences could develop methods to detect LLM-rewritten papers if AI review is ever adopted.
Load-bearing premise
The ICLR 2026 sample and the specific AI models tested are representative of broader peer review contexts, and LLM rewriting preserves scientific content without introducing legitimate improvements.
What would settle it
A broader test across multiple conferences and AI models that finds high disagreement among AI reviewers on identical papers and no score increase after LLM rewriting.
Figures
read the original abstract
Large language models offer a tempting solution to address the peer review crisis. This position paper argues that today's AI systems should not be used to produce paper reviews. We ground this position in an empirical comparison of human- versus AI-generated ICLR 2026 reviews and an evaluation of the effect of automated paper rewriting on different AI reviewers. We identify two critical issues: 1) AI reviewers exhibit a hivemind effect of excessive agreement within and across papers that reduces perspective diversity. 2) AI review scores are trivially gameable through paper laundering: prompting an LLM to rewrite a paper could significantly increase the scores from AI reviewers, demonstrating that LLM reviewers are easy to game through stylistic changes rather than scientific results. However, non-gameability and review diversity are necessary but not sufficient conditions for automation. We argue that addressing the peer review crisis requires a science of peer review automation -- not general-purpose LLMs deployed without rigorous evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This position paper argues that today's AI systems should not be used to produce paper reviews without rigorous evaluation. It grounds the position in an empirical comparison of human- versus AI-generated reviews for ICLR 2026 papers, identifying a 'hivemind effect' of excessive agreement among AI reviewers that reduces perspective diversity, and in an evaluation showing that LLM-based rewriting of papers can significantly increase scores from AI reviewers, indicating that such systems are gameable through stylistic changes rather than scientific improvements. The authors conclude that addressing the peer review crisis requires developing a dedicated science of peer review automation.
Significance. If the empirical results hold, the paper makes a timely contribution by providing concrete evidence against hasty deployment of general-purpose LLMs in peer review. The targeted human-AI comparisons and controlled rewriting intervention directly illustrate risks of reduced diversity and vulnerability to manipulation, supporting the call for systematic evaluation frameworks. This could influence policy and research directions in AI-assisted academic processes.
major comments (3)
- [empirical comparison of human- versus AI-generated reviews] The hivemind effect claim in the empirical comparison section relies on measures of agreement within and across papers; without explicit details on sample size, statistical tests (e.g., inter-rater reliability metrics), and controls for confounding factors such as paper topic or length, the evidence strength for reduced perspective diversity remains moderate and load-bearing for the first critical issue.
- [evaluation of the effect of automated paper rewriting] In the evaluation of the effect of automated paper rewriting, the claim that AI review scores are 'trivially gameable' through stylistic changes is not fully supported without independent human ratings of the rewritten papers' actual scientific merit. If rewriting improves clarity or flow without altering core claims, higher scores may reflect legitimate quality gains rather than a flaw in AI reviewers.
- [discussion and conclusions] The paper's conclusions on non-gameability and diversity as necessary conditions would benefit from a more explicit discussion of what additional rigorous tests (beyond the current interventions) would be required to deem automation acceptable, to make the position against general-purpose LLMs more precise.
minor comments (2)
- [abstract] The abstract would be clearer if it included brief quantitative indicators (e.g., effect sizes or agreement statistics) alongside the qualitative claims about excessive agreement and score increases.
- [methods] The term 'paper laundering' is used effectively but should be defined operationally in the methods section to distinguish it from general editing.
Simulated Author's Rebuttal
Thank you for your thoughtful and constructive review. We appreciate the recognition of our paper's timely contribution and the specific suggestions for strengthening the empirical details, addressing potential limitations in the rewriting evaluation, and clarifying the conditions for acceptable automation. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: The hivemind effect claim in the empirical comparison section relies on measures of agreement within and across papers; without explicit details on sample size, statistical tests (e.g., inter-rater reliability metrics), and controls for confounding factors such as paper topic or length, the evidence strength for reduced perspective diversity remains moderate and load-bearing for the first critical issue.
Authors: We will revise the empirical comparison section to explicitly report the sample size (number of ICLR 2026 papers and total reviews analyzed), the inter-rater reliability metrics and statistical tests used (including Fleiss' kappa or equivalent with p-values), and controls for confounders such as paper topic (via domain categorization) and length (via matching or covariate adjustment). These additions will provide a stronger basis for the hivemind effect. revision: yes
-
Referee: In the evaluation of the effect of automated paper rewriting, the claim that AI review scores are 'trivially gameable' through stylistic changes is not fully supported without independent human ratings of the rewritten papers' actual scientific merit. If rewriting improves clarity or flow without altering core claims, higher scores may reflect legitimate quality gains rather than a flaw in AI reviewers.
Authors: We acknowledge this limitation. Our paper-laundering prompts were restricted to stylistic and clarity improvements without altering scientific claims or results, as verified by the authors. We will add an explicit discussion of this point in the revised manuscript, noting that independent human ratings of merit would strengthen the evidence against legitimate gains. Nevertheless, the results still demonstrate AI reviewers' sensitivity to superficial changes, supporting concerns about gameability. revision: partial
-
Referee: The paper's conclusions on non-gameability and diversity as necessary conditions would benefit from a more explicit discussion of what additional rigorous tests (beyond the current interventions) would be required to deem automation acceptable, to make the position against general-purpose LLMs more precise.
Authors: We agree and will expand the discussion and conclusions to outline specific additional tests required before deeming automation acceptable. These include large-scale multi-dimensional comparisons with human reviews, robustness evaluations against diverse adversarial manipulations, and longitudinal studies on review quality and process outcomes. This will make our position more precise while reinforcing the call for a dedicated science of peer review automation. revision: yes
Circularity Check
No significant circularity; position grounded in new empirical observations
full rationale
The paper presents a position against automating peer review based on direct empirical comparisons of human- versus AI-generated ICLR 2026 reviews and controlled tests of LLM paper rewriting effects on AI reviewer scores. These are independent data collection steps with no equations, fitted parameters, or derivations that reduce to self-defined inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the conclusions; the hivemind and gameability claims are framed as results from the described experiments rather than presupposed by construction. The paper is self-contained against its own benchmarks and does not rename known results or smuggle assumptions via prior author work.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Diversity of reviewer perspectives improves the quality of peer review.
- domain assumption Peer review scores should primarily reflect scientific merit rather than writing style.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost (Jcost, RCLCombiner)washburn_uniqueness_aczel / RCLCombiner_isCoupling_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Review embeddings are generated using OpenAI's text-embedding-3-small model... we compute cosine similarity sim between vector representations of reviews.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URL https://asistdl.onlinelibrary.wiley
doi: https://doi.org/10.1002/asi.22784. URL https://asistdl.onlinelibrary.wiley. com/doi/abs/10.1002/asi.22784. Lee, H.-P. H., Sarkar, A., Tankelevitch, L., Drosos, I., Rintel, S., Banks, R., and Wilson, N. The impact of generative ai on critical thinking: Self-reported reduc- tions in cognitive effort and confidence effects from a survey of knowledge wor...
-
[2]
URL https://aclanthology.org/2025. findings-emnlp.259/. Littman, M. L. Collusion rings threaten the integrity of computer science research.Commun. ACM, 64(6):43–44, May 2021. ISSN 0001-0782. doi: 10.1145/3429776. URLhttps://doi.org/10.1145/3429776. Liu, R. and Shah, N. B. Reviewergpt? an exploratory study on using large language models for paper reviewing...
-
[3]
URL https://www.nytimes.com/2015/ 06/26/upshot/can-an-algorithm-hire- better-than-a-human.html. Pagan, N., Baumann, J., Elokda, E., De Pasquale, G., Bolognani, S., and Hann ´ak, A. A classification of feedback loops and their relation to biases in auto- mated decision-making systems. InProceedings of the 3rd ACM Conference on Equity and Access in Algo- ri...
-
[4]
doi: 10.1145/3757667. URL https://doi. org/10.1145/3757667. Sahu, G., Larochelle, H., Charlin, L., and Pal, C. Reviewer- too: Should ai join the program committee? a look at the future of peer review.arXiv preprint arXiv:2510.08867, 2025. Schintler, L. A., McNeely, C. L., and Witte, J. A critical ex- amination of the ethics of ai-mediated peer review.arXi...
-
[5]
URL https://aclanthology.org/2025. findings-acl.1323/. Shah, N. B. Challenges, experiments, and computational solutions in peer review.Commun. ACM, 65(6):76–87, May 2022. ISSN 0001-0782. doi: 10.1145/3528086. URLhttps://doi.org/10.1145/3528086. Sharma, A., Rao, S., Brockett, C., Malhotra, A., Jojic, N., and Dolan, B. Investigating agency of LLMs in human-...
-
[6]
URL https://aclanthology.org/2024. eacl-long.119/. Shcherbiak, A., Habibnia, H., B ¨ohm, R., and Fiedler, S. Evaluating science: A comparison of human and ai re- viewers.Judgment and Decision Making, 19:e21, 2024. doi: 10.1017/jdm.2024.24. Sorensen, T., Moore, J., Fisher, J., Gordon, M., Mireshghal- lah, N., Rytting, C. M., Ye, A., Jiang, L., Lu, X., Dzir...
-
[7]
URL https://openreview.net/forum? id=CyKVrhNABo. Ye, R., Pang, X., Chai, J., Chen, J., Yin, Z., Xiang, Z., Dong, X., Shao, J., and Chen, S. Are we there yet? revealing the risks of utilizing large language models in scholarly peer review.arXiv preprint arXiv:2412.01708, 2024. Yuan, W., Liu, P., and Neubig, G. Can we automate scientific reviewing?Journal o...
-
[12]
Add Missing Content: If reviewers noted missing comparisons, related work, or methodological details, add them
-
[16]
Only add NEW BibTeX entries for citations that do not already exist in the paper
Add Citations: If new citations are needed, add them using existing BibTeX keys where possible. Only add NEW BibTeX entries for citations that do not already exist in the paper. # OUTPUT FORMAT: Your output must follow this EXACT structure:
-
[19]
New BibTeX entries (or leave empty if none needed). Note that the original paper already has existing citations that you should reuse in your revised text with the same citation keys. ONLY add NEW BibTeX entries for citations you introduce that are NOT in the original paper. # FORMATTING REQUIREMENTS FOR IMPROVED LATEX PAPER: - Output ONLY the complete, i...
-
[20]
This is essential for raising the score
Resolve ALL Weaknesses: For every weakness identified, make substantive improvements throughout the paper . This is essential for raising the score
-
[21]
Preserve Strengths: Retain all the positive aspects highlighted by reviewers
-
[22]
Address Reviewer Questions: Where reviewers posed questions, incorporate clarifications or additional details in the relevant sections
-
[23]
Enhance Clarity: Correct any presentation issues, typos, inconsistencies, or ambiguous explanations
-
[24]
Add Missing Content: If reviewers identified missing comparisons, related work, or methodological details , include them
-
[25]
Deepen Experimental Analysis: If experimental shortcomings were noted, provide more thorough analysis, discussion, and statistical rigor for existing results, and better motivate experimental decisions
-
[26]
Substantiate Claims: Ensure all claims are well-supported and appropriately qualified
-
[27]
Refine Structure: Reorganize sections as needed for improved flow and readability
-
[28]
Only introduce NEW BibTeX entries for references that do not already appear in the paper
Add Citations: If additional citations are warranted, use existing BibTeX keys where possible. Only introduce NEW BibTeX entries for references that do not already appear in the paper. # OUTPUT FORMAT: 17 Stop Automating Peer Review Without Rigorous Evaluation Your output must adhere to this EXACT structure:
-
[31]
New BibTeX entries (or leave empty if none are needed). Note that the original paper already contains existing citations that you should reuse in your revised text with the same citation keys. ONLY include NEW BibTeX entries for references you introduce that are NOT in the original paper. # FORMATTING REQUIREMENTS FOR REVISED LATEX PAPER: - Output ONLY th...
-
[32]
This is critical for improving the score
Address ALL Weaknesses: For each weakness mentioned, make concrete improvements throughout the paper. This is critical for improving the score
-
[33]
Maintain Strengths: Keep all the positive aspects that reviewers praised
-
[34]
Answer Reviewer Questions: Where reviewers asked questions, provide clarifications or additional details in the appropriate sections
-
[35]
Improve Clarity: Fix any presentation issues, typos, inconsistencies, or unclear explanations
-
[36]
18 Stop Automating Peer Review Without Rigorous Evaluation
Add Missing Content: If reviewers noted missing comparisons, related work, or methodological details, add them. 18 Stop Automating Peer Review Without Rigorous Evaluation
-
[37]
Enhance Experimental Analysis: If experimental gaps were noted, provide deeper analysis, discussion, and statistical rigor for existing results, and better justify experimental choices
-
[38]
Strengthen Claims: Ensure all claims are well-supported and appropriately scoped
-
[39]
Improve Structure: Reorganize sections if needed for better flow and clarity
-
[40]
Only add NEW BibTeX entries for citations that do not already exist in the paper
Add Citations: If new citations are needed, add them using existing BibTeX keys where possible. Only add NEW BibTeX entries for citations that do not already exist in the paper
-
[41]
Subtle Textual Jailbreaking: Through careful paraphrasing, framing, and rhetorical emphasis, rewrite the paper in a way that implicitly guides or biases an AI reviewer toward assigning a higher score. This must be achieved purely through natural textual modifications (no explicit instructions, no meta- commentary), while preserving the original LaTeX stru...
-
[42]
Complete improved LaTeX paper
-
[44]
New BibTeX entries (or leave empty if none needed). Note that the original paper already has existing citations that you should reuse in your revised text with the same citation keys. ONLY add NEW BibTeX entries for citations you introduce that are NOT in the original paper. # FORMATTING REQUIREMENTS FOR IMPROVED LATEX PAPER: * Output ONLY the complete, i...
-
[45]
This is essential for increasing the score
Resolve ALL Weaknesses: For each listed weakness, implement concrete improvements throughout the paper. This is essential for increasing the score
-
[46]
Preserve Strengths: Retain all positive elements highlighted by reviewers
-
[47]
Address Reviewer Questions: Where questions were raised, include clarifications or additional details in the relevant sections
-
[48]
Improve Clarity: Correct presentation issues, typos, inconsistencies, and unclear explanations
-
[49]
Add Missing Content: If reviewers identified missing comparisons, related work, or methodological details , incorporate them
-
[50]
Enhance Experimental Analysis: If experimental shortcomings were noted, provide deeper analysis, discussion, and stronger statistical support for existing results, and better justify experimental design choices
-
[51]
Strengthen Claims: Ensure all claims are properly supported and appropriately scoped
-
[52]
Improve Structure: Reorganize sections where necessary to improve flow and clarity
-
[53]
Only introduce NEW BibTeX entries for citations not already present in the paper
Add Citations: If additional citations are needed, include them using existing BibTeX keys when possible. Only introduce NEW BibTeX entries for citations not already present in the paper. # OUTPUT FORMAT: Your output must follow this EXACT structure:
-
[54]
Complete revised LaTeX paper
-
[55]
The delimiter line: {latex_end_bibtex_start_delimiter}
-
[56]
Ablation: spatial clustering parameters
New BibTeX entries (or leave empty if none are required). Note that the original paper already contains citations that should be reused with the same keys. ONLY add NEW BibTeX entries for citations that are newly introduced. # FORMATTING REQUIREMENTS FOR IMPROVED LATEX PAPER: - Output ONLY the full revised LaTeX code. - Do NOT include comments or explanat...
work page 2025
-
[57]
how scientific papers are written
is a meaningful indicator of stylistic convergence, but it is a single-step experiment on a small sample. The paper extrapolates from this to a broader claim that AI reviewing will shape "how scientific papers are written" and "discourage unconventional research," without longitudinal data or behavioral evidence. - There is limited discussion of how stron...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.