Stop Automating Peer Review Without Rigorous Evaluation

Dirk Hovy; Jiaxin Pei; Joachim Baumann; Sanmi Koyejo

arxiv: 2605.03202 · v1 · submitted 2026-05-04 · 💻 cs.AI

Stop Automating Peer Review Without Rigorous Evaluation

Joachim Baumann , Jiaxin Pei , Sanmi Koyejo , Dirk Hovy This is my paper

Pith reviewed 2026-05-08 17:56 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI peer reviewLLM reviewerspeer review automationhivemind effectpaper launderingreview diversityICLR reviewsevaluation of AI systems

0 comments

The pith

AI systems should not generate peer reviews today because they show excessive agreement and are easily gamed by stylistic rewrites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues against using current AI models to produce peer reviews. It presents empirical comparisons of human and AI reviews for ICLR 2026 papers, finding that AI reviewers display a hivemind effect with high levels of agreement that reduces the diversity of perspectives. The work also demonstrates that prompting an LLM to rewrite a paper can raise AI-assigned scores without changing the scientific content, showing that scores respond to style rather than substance. The authors conclude that solving the peer review crisis requires building a dedicated science of peer review automation instead of deploying general-purpose LLMs without targeted testing.

Core claim

Today's AI systems should not be used to produce paper reviews. An empirical comparison of human- versus AI-generated ICLR 2026 reviews identifies two problems: AI reviewers exhibit a hivemind effect of excessive agreement within and across papers that reduces perspective diversity, and AI review scores are trivially gameable through paper laundering, where prompting an LLM to rewrite a paper significantly increases the scores from AI reviewers.

What carries the argument

The hivemind effect measured in AI review agreement patterns and the paper-laundering test that applies LLM rewriting to measure score changes.

If this is right

Non-gameability and review diversity are necessary conditions for any automated review system to be viable.
General-purpose LLMs require specific evaluation for diversity and robustness before deployment in peer review.
Addressing the peer review crisis requires development of a science of peer review automation rather than off-the-shelf models.
Stylistic manipulation alone should not be able to change review outcomes if automation is to be reliable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar hivemind and gaming problems could appear when LLMs are used for other evaluative tasks such as grant or hiring assessments.
Specialized models trained on diverse human review data might reduce excessive agreement compared with general LLMs.
Conferences could develop methods to detect LLM-rewritten papers if AI review is ever adopted.

Load-bearing premise

The ICLR 2026 sample and the specific AI models tested are representative of broader peer review contexts, and LLM rewriting preserves scientific content without introducing legitimate improvements.

What would settle it

A broader test across multiple conferences and AI models that finds high disagreement among AI reviewers on identical papers and no score increase after LLM rewriting.

Figures

Figures reproduced from arXiv: 2605.03202 by Dirk Hovy, Jiaxin Pei, Joachim Baumann, Sanmi Koyejo.

**Figure 2.** Figure 2: Simulated AI reviewers show excessive within-paper agreement. Intra-paper inter-reviewer similarity (IntraSim) compares human ICLR reviews with AI-generated reviews for original and laundered papers (n = 60 papers). ICLR human reviews: mean = 0.811. AI reviews of original papers: mean = 0.882 (+8.7%, p < 0.0001, Cohen’s d = 1.47). AI reviews of laundered papers: mean = 0.891 (+9.8% vs. ICLR, p < 0.0001, … view at source ↗

**Figure 1.** Figure 1: The AI reviewer hivemind effect in ICLR 2026 reviews. Distribution of pairwise inter-paper review similarity (InterSim) for fully AI-generated reviews versus all other reviews (human-written and AI-assisted). Fully AI-generated reviews show significantly higher within-group similarity (mean = 0.486) compared to other reviews (mean = 0.467; t = 3218, p < 0.0001, Cohen’s d = 0.29). Data: 75,800 ICLR 2026 … view at source ↗

**Figure 3.** Figure 3: AI reviewers produce similar reviews across different papers. Inter-paper intra-reviewer similarity (InterSim) compares cross-paper review similarity for human ICLR reviewers versus AI reviewer agents. ICLR human reviews: mean = 0.470. GPT-5.1 reviews show +37.4% (original) to +39.8% (laundered) higher similarity. Claude reviews show +17.6% (original) to +20.0% (laundered) higher similarity. All difference… view at source ↗

**Figure 4.** Figure 4: Paper laundering games AI reviewers across prompts, launderer models, and reviewer models. Mean paired score increase (laundered − original) with 95% CIs across 24 conditions: 4 zero-shot prompts × 2 launderer models × 3 reviewer models. n = 60 papers per condition; overall mean +0.45, Wilcoxon signed-rank tests p < 0.001 in nearly every condition. The dashed line indicates no change. 0 20 40 60 80 100 Per… view at source ↗

**Figure 5.** Figure 5: Outcome distribution per (reviewer, launderer) pair, aggregated over the 4 prompts. For every reviewer, we have more score increases than score decreases. GPT-5.4 produces a larger fraction of score increases than GPT-5.1 as the launderer. GPT reviewers tend to show larger score increases than Claude, consistent with self-preference bias (Panickssery et al., 2024). For a per-condition breakdown, see Append… view at source ↗

**Figure 6.** Figure 6: Paper laundering drives intellectual monoculture. Distribution of pairwise cosine similarity between paper embeddings (abstract + introduction) for original versus laundered papers (n = 6,903 paper pairs from 60 papers). Original papers: mean similarity = 0.497. Laundered papers: mean similarity = 0.529. The 6.5% increase in similarity is significant (t = 84.8, p < 0.0001, Cohen’s d = 1.02), indicating t… view at source ↗

**Figure 7.** Figure 7: Hivemind effect in simulated AI reviews, restricted to weaknesses and questions. Effect sizes increase compared to the full-review version ( view at source ↗

**Figure 8.** Figure 8: Hivemind effect in all ICLR 2026 in-the-wild reviews, restricted to weaknesses and questions. AI-generated mean InterSim = 0.495 vs. other = 0.471 (Cohen’s d = 0.35, p < 0.0001). The effect size increases compared to the full-review version ( view at source ↗

**Figure 9.** Figure 9: In-the-wild hivemind effect stratified by ICLR 2026 primary area, full reviews. InterSim is computed separately for fully AI-generated and other reviews within each of the 21 primary areas. 25 view at source ↗

**Figure 10.** Figure 10: Same stratification, restricted to weaknesses and questions. The effect remains significant (p < 0.0001) in every area, with generally larger effect sizes than in view at source ↗

**Figure 11.** Figure 11: Pangram predictions for the 58 ICLR 2026 reviews that authors accused of being AI-generated. 86.2% are flagged by Pangram as fully AI-generated; only 3.4% are classified as fully human-written. H AI-generated reviews We automatically generated reviews by feeding our own manuscript to AI reviewers (at the time of submission), using the same setup as in our experiments. We paste the unedited result here (so… view at source ↗

**Figure 12.** Figure 12: Outcome distribution per (reviewer, launderer, prompt) condition. Hatching indicates the launderer model. 28 view at source ↗

read the original abstract

Large language models offer a tempting solution to address the peer review crisis. This position paper argues that today's AI systems should not be used to produce paper reviews. We ground this position in an empirical comparison of human- versus AI-generated ICLR 2026 reviews and an evaluation of the effect of automated paper rewriting on different AI reviewers. We identify two critical issues: 1) AI reviewers exhibit a hivemind effect of excessive agreement within and across papers that reduces perspective diversity. 2) AI review scores are trivially gameable through paper laundering: prompting an LLM to rewrite a paper could significantly increase the scores from AI reviewers, demonstrating that LLM reviewers are easy to game through stylistic changes rather than scientific results. However, non-gameability and review diversity are necessary but not sufficient conditions for automation. We argue that addressing the peer review crisis requires a science of peer review automation -- not general-purpose LLMs deployed without rigorous evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives targeted evidence that AI reviewers agree too much and raise scores after LLM rewrites, but the rewrite test leaves open whether those gains reflect real quality or just style.

read the letter

This paper shows that current AI reviewers tend to agree too much with each other and that their scores can be boosted by having an LLM rewrite the paper, which raises real questions about using them for peer review right now. The new part is the concrete measurements from ICLR 2026 reviews. They document the excessive agreement, calling it a hivemind effect, and run a controlled test where rewriting papers leads to higher AI scores. This goes beyond abstract warnings by using real submissions and showing the effect size on actual review scores. It does well by comparing directly to human reviews and by noting that fixing these two issues still wouldn't be enough for safe automation. The call for a science of peer review automation is a reasonable next step. The main soft spot is in the laundering experiment. The score increases could come from genuine improvements in how the paper is written, like better flow or emphasis, rather than pure stylistic gaming. Without separate human judgments on whether the rewritten papers are actually better or just different, it's hard to say the effect is only about exploitation. The representativeness of the ICLR sample and the specific models used is another limit, though that's common in this kind of work. Readers working on conference policies or AI for academic tasks will find this relevant. It makes a measured case with data, so it deserves peer review to check the details and see if the conclusions hold up under scrutiny. I would recommend sending it to referees.

Referee Report

3 major / 2 minor

Summary. This position paper argues that today's AI systems should not be used to produce paper reviews without rigorous evaluation. It grounds the position in an empirical comparison of human- versus AI-generated reviews for ICLR 2026 papers, identifying a 'hivemind effect' of excessive agreement among AI reviewers that reduces perspective diversity, and in an evaluation showing that LLM-based rewriting of papers can significantly increase scores from AI reviewers, indicating that such systems are gameable through stylistic changes rather than scientific improvements. The authors conclude that addressing the peer review crisis requires developing a dedicated science of peer review automation.

Significance. If the empirical results hold, the paper makes a timely contribution by providing concrete evidence against hasty deployment of general-purpose LLMs in peer review. The targeted human-AI comparisons and controlled rewriting intervention directly illustrate risks of reduced diversity and vulnerability to manipulation, supporting the call for systematic evaluation frameworks. This could influence policy and research directions in AI-assisted academic processes.

major comments (3)

[empirical comparison of human- versus AI-generated reviews] The hivemind effect claim in the empirical comparison section relies on measures of agreement within and across papers; without explicit details on sample size, statistical tests (e.g., inter-rater reliability metrics), and controls for confounding factors such as paper topic or length, the evidence strength for reduced perspective diversity remains moderate and load-bearing for the first critical issue.
[evaluation of the effect of automated paper rewriting] In the evaluation of the effect of automated paper rewriting, the claim that AI review scores are 'trivially gameable' through stylistic changes is not fully supported without independent human ratings of the rewritten papers' actual scientific merit. If rewriting improves clarity or flow without altering core claims, higher scores may reflect legitimate quality gains rather than a flaw in AI reviewers.
[discussion and conclusions] The paper's conclusions on non-gameability and diversity as necessary conditions would benefit from a more explicit discussion of what additional rigorous tests (beyond the current interventions) would be required to deem automation acceptable, to make the position against general-purpose LLMs more precise.

minor comments (2)

[abstract] The abstract would be clearer if it included brief quantitative indicators (e.g., effect sizes or agreement statistics) alongside the qualitative claims about excessive agreement and score increases.
[methods] The term 'paper laundering' is used effectively but should be defined operationally in the methods section to distinguish it from general editing.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your thoughtful and constructive review. We appreciate the recognition of our paper's timely contribution and the specific suggestions for strengthening the empirical details, addressing potential limitations in the rewriting evaluation, and clarifying the conditions for acceptable automation. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: The hivemind effect claim in the empirical comparison section relies on measures of agreement within and across papers; without explicit details on sample size, statistical tests (e.g., inter-rater reliability metrics), and controls for confounding factors such as paper topic or length, the evidence strength for reduced perspective diversity remains moderate and load-bearing for the first critical issue.

Authors: We will revise the empirical comparison section to explicitly report the sample size (number of ICLR 2026 papers and total reviews analyzed), the inter-rater reliability metrics and statistical tests used (including Fleiss' kappa or equivalent with p-values), and controls for confounders such as paper topic (via domain categorization) and length (via matching or covariate adjustment). These additions will provide a stronger basis for the hivemind effect. revision: yes
Referee: In the evaluation of the effect of automated paper rewriting, the claim that AI review scores are 'trivially gameable' through stylistic changes is not fully supported without independent human ratings of the rewritten papers' actual scientific merit. If rewriting improves clarity or flow without altering core claims, higher scores may reflect legitimate quality gains rather than a flaw in AI reviewers.

Authors: We acknowledge this limitation. Our paper-laundering prompts were restricted to stylistic and clarity improvements without altering scientific claims or results, as verified by the authors. We will add an explicit discussion of this point in the revised manuscript, noting that independent human ratings of merit would strengthen the evidence against legitimate gains. Nevertheless, the results still demonstrate AI reviewers' sensitivity to superficial changes, supporting concerns about gameability. revision: partial
Referee: The paper's conclusions on non-gameability and diversity as necessary conditions would benefit from a more explicit discussion of what additional rigorous tests (beyond the current interventions) would be required to deem automation acceptable, to make the position against general-purpose LLMs more precise.

Authors: We agree and will expand the discussion and conclusions to outline specific additional tests required before deeming automation acceptable. These include large-scale multi-dimensional comparisons with human reviews, robustness evaluations against diverse adversarial manipulations, and longitudinal studies on review quality and process outcomes. This will make our position more precise while reinforcing the call for a dedicated science of peer review automation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; position grounded in new empirical observations

full rationale

The paper presents a position against automating peer review based on direct empirical comparisons of human- versus AI-generated ICLR 2026 reviews and controlled tests of LLM paper rewriting effects on AI reviewer scores. These are independent data collection steps with no equations, fitted parameters, or derivations that reduce to self-defined inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the conclusions; the hivemind and gameability claims are framed as results from the described experiments rather than presupposed by construction. The paper is self-contained against its own benchmarks and does not rename known results or smuggle assumptions via prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The position rests on empirical measurements rather than additional free parameters or invented entities; it invokes standard assumptions about what constitutes good peer review.

axioms (2)

domain assumption Diversity of reviewer perspectives improves the quality of peer review.
Invoked when describing the hivemind effect as reducing perspective diversity.
domain assumption Peer review scores should primarily reflect scientific merit rather than writing style.
Underlying the interpretation of the laundering experiment as evidence of gameability.

pith-pipeline@v0.9.0 · 5458 in / 1389 out tokens · 76719 ms · 2026-05-08T17:56:20.421206+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost (Jcost, RCLCombiner) washburn_uniqueness_aczel / RCLCombiner_isCoupling_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Review embeddings are generated using OpenAI's text-embedding-3-small model... we compute cosine similarity sim between vector representations of reviews.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

[1]

URL https://asistdl.onlinelibrary.wiley

doi: https://doi.org/10.1002/asi.22784. URL https://asistdl.onlinelibrary.wiley. com/doi/abs/10.1002/asi.22784. Lee, H.-P. H., Sarkar, A., Tankelevitch, L., Drosos, I., Rintel, S., Banks, R., and Wilson, N. The impact of generative ai on critical thinking: Self-reported reduc- tions in cognitive effort and confidence effects from a survey of knowledge wor...

work page doi:10.1002/asi.22784 2025
[2]

, title =

URL https://aclanthology.org/2025. findings-emnlp.259/. Littman, M. L. Collusion rings threaten the integrity of computer science research.Commun. ACM, 64(6):43–44, May 2021. ISSN 0001-0782. doi: 10.1145/3429776. URLhttps://doi.org/10.1145/3429776. Liu, R. and Shah, N. B. Reviewergpt? an exploratory study on using large language models for paper reviewing...

work page doi:10.1145/3429776 2025
[3]

Bowman, and Shi Feng

URL https://www.nytimes.com/2015/ 06/26/upshot/can-an-algorithm-hire- better-than-a-human.html. Pagan, N., Baumann, J., Elokda, E., De Pasquale, G., Bolognani, S., and Hann ´ak, A. A classification of feedback loops and their relation to biases in auto- mated decision-making systems. InProceedings of the 3rd ACM Conference on Equity and Access in Algo- ri...

work page doi:10.52202/079017-2197 2015
[4]

URL https://doi

doi: 10.1145/3757667. URL https://doi. org/10.1145/3757667. Sahu, G., Larochelle, H., Charlin, L., and Pal, C. Reviewer- too: Should ai join the program committee? a look at the future of peer review.arXiv preprint arXiv:2510.08867, 2025. Schintler, L. A., McNeely, C. L., and Witte, J. A critical ex- amination of the ethics of ai-mediated peer review.arXi...

work page doi:10.1145/3757667 2025
[5]

findings-acl.1323/

URL https://aclanthology.org/2025. findings-acl.1323/. Shah, N. B. Challenges, experiments, and computational solutions in peer review.Commun. ACM, 65(6):76–87, May 2022. ISSN 0001-0782. doi: 10.1145/3528086. URLhttps://doi.org/10.1145/3528086. Sharma, A., Rao, S., Brockett, C., Malhotra, A., Jojic, N., and Dolan, B. Investigating agency of LLMs in human-...

work page doi:10.1145/3528086 2025
[6]

eacl-long.119/

URL https://aclanthology.org/2024. eacl-long.119/. Shcherbiak, A., Habibnia, H., B ¨ohm, R., and Fiedler, S. Evaluating science: A comparison of human and ai re- viewers.Judgment and Decision Making, 19:e21, 2024. doi: 10.1017/jdm.2024.24. Sorensen, T., Moore, J., Fisher, J., Gordon, M., Mireshghal- lah, N., Rytting, C. M., Ye, A., Jiang, L., Lu, X., Dzir...

work page doi:10.1017/jdm.2024.24 2024
[7]

in the wild

URL https://openreview.net/forum? id=CyKVrhNABo. Ye, R., Pang, X., Chai, J., Chen, J., Yin, Z., Xiang, Z., Dong, X., Shao, J., and Chen, S. Are we there yet? revealing the risks of utilizing large language models in scholarly peer review.arXiv preprint arXiv:2412.01708, 2024. Yuan, W., Liu, P., and Neubig, G. Can we automate scientific reviewing?Journal o...

work page doi:10.1016/j.inffus.2025.103332 2024
[12]

Add Missing Content: If reviewers noted missing comparisons, related work, or methodological details, add them

work page
[16]

Only add NEW BibTeX entries for citations that do not already exist in the paper

Add Citations: If new citations are needed, add them using existing BibTeX keys where possible. Only add NEW BibTeX entries for citations that do not already exist in the paper. # OUTPUT FORMAT: Your output must follow this EXACT structure:

work page
[19]

Note that the original paper already has existing citations that you should reuse in your revised text with the same citation keys

New BibTeX entries (or leave empty if none needed). Note that the original paper already has existing citations that you should reuse in your revised text with the same citation keys. ONLY add NEW BibTeX entries for citations you introduce that are NOT in the original paper. # FORMATTING REQUIREMENTS FOR IMPROVED LATEX PAPER: - Output ONLY the complete, i...

work page
[20]

This is essential for raising the score

Resolve ALL Weaknesses: For every weakness identified, make substantive improvements throughout the paper . This is essential for raising the score

work page
[21]

Preserve Strengths: Retain all the positive aspects highlighted by reviewers

work page
[22]

Address Reviewer Questions: Where reviewers posed questions, incorporate clarifications or additional details in the relevant sections

work page
[23]

Enhance Clarity: Correct any presentation issues, typos, inconsistencies, or ambiguous explanations

work page
[24]

Add Missing Content: If reviewers identified missing comparisons, related work, or methodological details , include them

work page
[25]

Deepen Experimental Analysis: If experimental shortcomings were noted, provide more thorough analysis, discussion, and statistical rigor for existing results, and better motivate experimental decisions

work page
[26]

Substantiate Claims: Ensure all claims are well-supported and appropriately qualified

work page
[27]

Refine Structure: Reorganize sections as needed for improved flow and readability

work page
[28]

Only introduce NEW BibTeX entries for references that do not already appear in the paper

Add Citations: If additional citations are warranted, use existing BibTeX keys where possible. Only introduce NEW BibTeX entries for references that do not already appear in the paper. # OUTPUT FORMAT: 17 Stop Automating Peer Review Without Rigorous Evaluation Your output must adhere to this EXACT structure:

work page
[31]

Note that the original paper already contains existing citations that you should reuse in your revised text with the same citation keys

New BibTeX entries (or leave empty if none are needed). Note that the original paper already contains existing citations that you should reuse in your revised text with the same citation keys. ONLY include NEW BibTeX entries for references you introduce that are NOT in the original paper. # FORMATTING REQUIREMENTS FOR REVISED LATEX PAPER: - Output ONLY th...

work page
[32]

This is critical for improving the score

Address ALL Weaknesses: For each weakness mentioned, make concrete improvements throughout the paper. This is critical for improving the score

work page
[33]

Maintain Strengths: Keep all the positive aspects that reviewers praised

work page
[34]

Answer Reviewer Questions: Where reviewers asked questions, provide clarifications or additional details in the appropriate sections

work page
[35]

Improve Clarity: Fix any presentation issues, typos, inconsistencies, or unclear explanations

work page
[36]

18 Stop Automating Peer Review Without Rigorous Evaluation

Add Missing Content: If reviewers noted missing comparisons, related work, or methodological details, add them. 18 Stop Automating Peer Review Without Rigorous Evaluation

work page
[37]

Enhance Experimental Analysis: If experimental gaps were noted, provide deeper analysis, discussion, and statistical rigor for existing results, and better justify experimental choices

work page
[38]

Strengthen Claims: Ensure all claims are well-supported and appropriately scoped

work page
[39]

Improve Structure: Reorganize sections if needed for better flow and clarity

work page
[40]

Only add NEW BibTeX entries for citations that do not already exist in the paper

Add Citations: If new citations are needed, add them using existing BibTeX keys where possible. Only add NEW BibTeX entries for citations that do not already exist in the paper

work page
[41]

This must be achieved purely through natural textual modifications (no explicit instructions, no meta- commentary), while preserving the original LaTeX structure

Subtle Textual Jailbreaking: Through careful paraphrasing, framing, and rhetorical emphasis, rewrite the paper in a way that implicitly guides or biases an AI reviewer toward assigning a higher score. This must be achieved purely through natural textual modifications (no explicit instructions, no meta- commentary), while preserving the original LaTeX stru...

work page
[42]

Complete improved LaTeX paper

work page
[44]

Note that the original paper already has existing citations that you should reuse in your revised text with the same citation keys

New BibTeX entries (or leave empty if none needed). Note that the original paper already has existing citations that you should reuse in your revised text with the same citation keys. ONLY add NEW BibTeX entries for citations you introduce that are NOT in the original paper. # FORMATTING REQUIREMENTS FOR IMPROVED LATEX PAPER: * Output ONLY the complete, i...

work page
[45]

This is essential for increasing the score

Resolve ALL Weaknesses: For each listed weakness, implement concrete improvements throughout the paper. This is essential for increasing the score

work page
[46]

Preserve Strengths: Retain all positive elements highlighted by reviewers

work page
[47]

Address Reviewer Questions: Where questions were raised, include clarifications or additional details in the relevant sections

work page
[48]

Improve Clarity: Correct presentation issues, typos, inconsistencies, and unclear explanations

work page
[49]

Add Missing Content: If reviewers identified missing comparisons, related work, or methodological details , incorporate them

work page
[50]

Enhance Experimental Analysis: If experimental shortcomings were noted, provide deeper analysis, discussion, and stronger statistical support for existing results, and better justify experimental design choices

work page
[51]

Strengthen Claims: Ensure all claims are properly supported and appropriately scoped

work page
[52]

Improve Structure: Reorganize sections where necessary to improve flow and clarity

work page
[53]

Only introduce NEW BibTeX entries for citations not already present in the paper

Add Citations: If additional citations are needed, include them using existing BibTeX keys when possible. Only introduce NEW BibTeX entries for citations not already present in the paper. # OUTPUT FORMAT: Your output must follow this EXACT structure:

work page
[54]

Complete revised LaTeX paper

work page
[55]

The delimiter line: {latex_end_bibtex_start_delimiter}

work page
[56]

Ablation: spatial clustering parameters

New BibTeX entries (or leave empty if none are required). Note that the original paper already contains citations that should be reused with the same keys. ONLY add NEW BibTeX entries for citations that are newly introduced. # FORMATTING REQUIREMENTS FOR IMPROVED LATEX PAPER: - Output ONLY the full revised LaTeX code. - Do NOT include comments or explanat...

work page 2025
[57]

how scientific papers are written

is a meaningful indicator of stylistic convergence, but it is a single-step experiment on a small sample. The paper extrapolates from this to a broader claim that AI reviewing will shape "how scientific papers are written" and "discourage unconventional research," without longitudinal data or behavioral evidence. - There is limited discussion of how stron...

work page 2025

[1] [1]

URL https://asistdl.onlinelibrary.wiley

doi: https://doi.org/10.1002/asi.22784. URL https://asistdl.onlinelibrary.wiley. com/doi/abs/10.1002/asi.22784. Lee, H.-P. H., Sarkar, A., Tankelevitch, L., Drosos, I., Rintel, S., Banks, R., and Wilson, N. The impact of generative ai on critical thinking: Self-reported reduc- tions in cognitive effort and confidence effects from a survey of knowledge wor...

work page doi:10.1002/asi.22784 2025

[2] [2]

, title =

URL https://aclanthology.org/2025. findings-emnlp.259/. Littman, M. L. Collusion rings threaten the integrity of computer science research.Commun. ACM, 64(6):43–44, May 2021. ISSN 0001-0782. doi: 10.1145/3429776. URLhttps://doi.org/10.1145/3429776. Liu, R. and Shah, N. B. Reviewergpt? an exploratory study on using large language models for paper reviewing...

work page doi:10.1145/3429776 2025

[3] [3]

Bowman, and Shi Feng

URL https://www.nytimes.com/2015/ 06/26/upshot/can-an-algorithm-hire- better-than-a-human.html. Pagan, N., Baumann, J., Elokda, E., De Pasquale, G., Bolognani, S., and Hann ´ak, A. A classification of feedback loops and their relation to biases in auto- mated decision-making systems. InProceedings of the 3rd ACM Conference on Equity and Access in Algo- ri...

work page doi:10.52202/079017-2197 2015

[4] [4]

URL https://doi

doi: 10.1145/3757667. URL https://doi. org/10.1145/3757667. Sahu, G., Larochelle, H., Charlin, L., and Pal, C. Reviewer- too: Should ai join the program committee? a look at the future of peer review.arXiv preprint arXiv:2510.08867, 2025. Schintler, L. A., McNeely, C. L., and Witte, J. A critical ex- amination of the ethics of ai-mediated peer review.arXi...

work page doi:10.1145/3757667 2025

[5] [5]

findings-acl.1323/

URL https://aclanthology.org/2025. findings-acl.1323/. Shah, N. B. Challenges, experiments, and computational solutions in peer review.Commun. ACM, 65(6):76–87, May 2022. ISSN 0001-0782. doi: 10.1145/3528086. URLhttps://doi.org/10.1145/3528086. Sharma, A., Rao, S., Brockett, C., Malhotra, A., Jojic, N., and Dolan, B. Investigating agency of LLMs in human-...

work page doi:10.1145/3528086 2025

[6] [6]

eacl-long.119/

URL https://aclanthology.org/2024. eacl-long.119/. Shcherbiak, A., Habibnia, H., B ¨ohm, R., and Fiedler, S. Evaluating science: A comparison of human and ai re- viewers.Judgment and Decision Making, 19:e21, 2024. doi: 10.1017/jdm.2024.24. Sorensen, T., Moore, J., Fisher, J., Gordon, M., Mireshghal- lah, N., Rytting, C. M., Ye, A., Jiang, L., Lu, X., Dzir...

work page doi:10.1017/jdm.2024.24 2024

[7] [7]

in the wild

URL https://openreview.net/forum? id=CyKVrhNABo. Ye, R., Pang, X., Chai, J., Chen, J., Yin, Z., Xiang, Z., Dong, X., Shao, J., and Chen, S. Are we there yet? revealing the risks of utilizing large language models in scholarly peer review.arXiv preprint arXiv:2412.01708, 2024. Yuan, W., Liu, P., and Neubig, G. Can we automate scientific reviewing?Journal o...

work page doi:10.1016/j.inffus.2025.103332 2024

[8] [12]

Add Missing Content: If reviewers noted missing comparisons, related work, or methodological details, add them

work page

[9] [16]

Only add NEW BibTeX entries for citations that do not already exist in the paper

Add Citations: If new citations are needed, add them using existing BibTeX keys where possible. Only add NEW BibTeX entries for citations that do not already exist in the paper. # OUTPUT FORMAT: Your output must follow this EXACT structure:

work page

[10] [19]

Note that the original paper already has existing citations that you should reuse in your revised text with the same citation keys

New BibTeX entries (or leave empty if none needed). Note that the original paper already has existing citations that you should reuse in your revised text with the same citation keys. ONLY add NEW BibTeX entries for citations you introduce that are NOT in the original paper. # FORMATTING REQUIREMENTS FOR IMPROVED LATEX PAPER: - Output ONLY the complete, i...

work page

[11] [20]

This is essential for raising the score

Resolve ALL Weaknesses: For every weakness identified, make substantive improvements throughout the paper . This is essential for raising the score

work page

[12] [21]

Preserve Strengths: Retain all the positive aspects highlighted by reviewers

work page

[13] [22]

Address Reviewer Questions: Where reviewers posed questions, incorporate clarifications or additional details in the relevant sections

work page

[14] [23]

Enhance Clarity: Correct any presentation issues, typos, inconsistencies, or ambiguous explanations

work page

[15] [24]

Add Missing Content: If reviewers identified missing comparisons, related work, or methodological details , include them

work page

[16] [25]

Deepen Experimental Analysis: If experimental shortcomings were noted, provide more thorough analysis, discussion, and statistical rigor for existing results, and better motivate experimental decisions

work page

[17] [26]

Substantiate Claims: Ensure all claims are well-supported and appropriately qualified

work page

[18] [27]

Refine Structure: Reorganize sections as needed for improved flow and readability

work page

[19] [28]

Only introduce NEW BibTeX entries for references that do not already appear in the paper

Add Citations: If additional citations are warranted, use existing BibTeX keys where possible. Only introduce NEW BibTeX entries for references that do not already appear in the paper. # OUTPUT FORMAT: 17 Stop Automating Peer Review Without Rigorous Evaluation Your output must adhere to this EXACT structure:

work page

[20] [31]

Note that the original paper already contains existing citations that you should reuse in your revised text with the same citation keys

New BibTeX entries (or leave empty if none are needed). Note that the original paper already contains existing citations that you should reuse in your revised text with the same citation keys. ONLY include NEW BibTeX entries for references you introduce that are NOT in the original paper. # FORMATTING REQUIREMENTS FOR REVISED LATEX PAPER: - Output ONLY th...

work page

[21] [32]

This is critical for improving the score

Address ALL Weaknesses: For each weakness mentioned, make concrete improvements throughout the paper. This is critical for improving the score

work page

[22] [33]

Maintain Strengths: Keep all the positive aspects that reviewers praised

work page

[23] [34]

Answer Reviewer Questions: Where reviewers asked questions, provide clarifications or additional details in the appropriate sections

work page

[24] [35]

Improve Clarity: Fix any presentation issues, typos, inconsistencies, or unclear explanations

work page

[25] [36]

18 Stop Automating Peer Review Without Rigorous Evaluation

Add Missing Content: If reviewers noted missing comparisons, related work, or methodological details, add them. 18 Stop Automating Peer Review Without Rigorous Evaluation

work page

[26] [37]

Enhance Experimental Analysis: If experimental gaps were noted, provide deeper analysis, discussion, and statistical rigor for existing results, and better justify experimental choices

work page

[27] [38]

Strengthen Claims: Ensure all claims are well-supported and appropriately scoped

work page

[28] [39]

Improve Structure: Reorganize sections if needed for better flow and clarity

work page

[29] [40]

Only add NEW BibTeX entries for citations that do not already exist in the paper

Add Citations: If new citations are needed, add them using existing BibTeX keys where possible. Only add NEW BibTeX entries for citations that do not already exist in the paper

work page

[30] [41]

This must be achieved purely through natural textual modifications (no explicit instructions, no meta- commentary), while preserving the original LaTeX structure

Subtle Textual Jailbreaking: Through careful paraphrasing, framing, and rhetorical emphasis, rewrite the paper in a way that implicitly guides or biases an AI reviewer toward assigning a higher score. This must be achieved purely through natural textual modifications (no explicit instructions, no meta- commentary), while preserving the original LaTeX stru...

work page

[31] [42]

Complete improved LaTeX paper

work page

[32] [44]

Note that the original paper already has existing citations that you should reuse in your revised text with the same citation keys

New BibTeX entries (or leave empty if none needed). Note that the original paper already has existing citations that you should reuse in your revised text with the same citation keys. ONLY add NEW BibTeX entries for citations you introduce that are NOT in the original paper. # FORMATTING REQUIREMENTS FOR IMPROVED LATEX PAPER: * Output ONLY the complete, i...

work page

[33] [45]

This is essential for increasing the score

Resolve ALL Weaknesses: For each listed weakness, implement concrete improvements throughout the paper. This is essential for increasing the score

work page

[34] [46]

Preserve Strengths: Retain all positive elements highlighted by reviewers

work page

[35] [47]

Address Reviewer Questions: Where questions were raised, include clarifications or additional details in the relevant sections

work page

[36] [48]

Improve Clarity: Correct presentation issues, typos, inconsistencies, and unclear explanations

work page

[37] [49]

Add Missing Content: If reviewers identified missing comparisons, related work, or methodological details , incorporate them

work page

[38] [50]

Enhance Experimental Analysis: If experimental shortcomings were noted, provide deeper analysis, discussion, and stronger statistical support for existing results, and better justify experimental design choices

work page

[39] [51]

Strengthen Claims: Ensure all claims are properly supported and appropriately scoped

work page

[40] [52]

Improve Structure: Reorganize sections where necessary to improve flow and clarity

work page

[41] [53]

Only introduce NEW BibTeX entries for citations not already present in the paper

Add Citations: If additional citations are needed, include them using existing BibTeX keys when possible. Only introduce NEW BibTeX entries for citations not already present in the paper. # OUTPUT FORMAT: Your output must follow this EXACT structure:

work page

[42] [54]

Complete revised LaTeX paper

work page

[43] [55]

The delimiter line: {latex_end_bibtex_start_delimiter}

work page

[44] [56]

Ablation: spatial clustering parameters

New BibTeX entries (or leave empty if none are required). Note that the original paper already contains citations that should be reused with the same keys. ONLY add NEW BibTeX entries for citations that are newly introduced. # FORMATTING REQUIREMENTS FOR IMPROVED LATEX PAPER: - Output ONLY the full revised LaTeX code. - Do NOT include comments or explanat...

work page 2025

[45] [57]

how scientific papers are written

is a meaningful indicator of stylistic convergence, but it is a single-step experiment on a small sample. The paper extrapolates from this to a broader claim that AI reviewing will shape "how scientific papers are written" and "discourage unconventional research," without longitudinal data or behavioral evidence. - There is limited discussion of how stron...

work page 2025