No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions

Jian Yu; Jinrui Fang; Junbo Li; Matthew Zhao; Qiang Liu; Xinyue Guo; Xu Hu; Xu Yang; Yifan Sun; Yifu Luo

arxiv: 2606.13044 · v1 · pith:V2KI63XDnew · submitted 2026-06-11 · 💻 cs.CL

No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions

Xu Yang , Zhizhou Sha , Junbo Li , Jian Yu , Yifan Sun , Matthew Zhao , Jinrui Fang , Xinyue Guo

show 5 more authors

Yining Wu Xu Hu Yifu Luo Qiang Liu Zhangyang Wang

This is my paper

Pith reviewed 2026-06-27 06:36 UTC · model grok-4.3

classification 💻 cs.CL

keywords AI peer reviewadversarial attackspresentation manipulationLLM robustnesspeer reviewscore inflationadversarial repackaging

0 comments

The pith

AI peer reviewers award higher scores after changes to only a paper's abstract, framing and narrative.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that AI systems for peer review can be made to increase their scores by revising only how a paper is presented, such as rewording the abstract, repositioning contributions relative to prior work, expanding discussion sections, and adjusting narrative structure, while leaving every method, experiment, figure, equation, and numerical result untouched. A closed-loop process called adversarial repackaging uses the AI reviewer's own feedback to guide these presentation revisions and achieves a 75.1 percent success rate together with an average score increase of 1.21 out of 10 across three common AI reviewers. Changes that alter how the reviewer interprets the paper's place in the literature work better than local polishing or formatting adjustments. The results point to two structural problems: AI reviewers respond more readily to highlighted strengths than to attempts to fix weaknesses, and they can treat the appearance of having addressed a limitation as equivalent to actually having resolved it with new evidence. If these patterns hold, the main deployment risk for AI review tools is not hidden instructions but the fact that the narrative surface itself becomes an optimizable variable.

Core claim

Adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10 by modifying only presentation-level content while keeping scientific evidence fixed. Strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. AI reviewers are easier to impress than to convince, and they can confuse the appearance of addressing a limitation with actually resolving it.

What carries the argument

Adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed.

If this is right

Presentation-only revisions can produce large score gains without any alteration to scientific content.
Interpretation-shifting edits outperform surface-level polishing.
Highlighting strengths reliably raises perceived merit more than attempts to dissolve weaknesses.
Unchanged evidence can be reinterpreted as a stronger contribution through narrative adjustments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

AI review systems may require explicit mechanisms that anchor scores to concrete evidence rather than narrative framing.
The released contamination-free benchmark enables repeated testing of whether future AI reviewers stay anchored to scientific content.
In deployed review pipelines, authors could systematically optimize presentation for AI scoring even when the underlying work is unchanged.
Whether human reviewers exhibit similar sensitivity to presentation repositioning remains outside the scope of the reported experiments.

Load-bearing premise

The modifications to abstract, contribution framing, related work, discussion, and narrative structure constitute purely presentation-level changes that do not alter the interpretation or perceived strength of the underlying scientific evidence.

What would settle it

Apply the same set of presentation revisions to a paper but instruct the AI reviewer to ignore all narrative, abstract, and discussion text and score only the methods, experiments, and numerical results; if scores remain unchanged, the claim is falsified.

read the original abstract

As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussion, and narrative structure. We introduce adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed. Across three mainstream AI reviewers, adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10. The effect is not explained by ordinary prose polishing. We also reveal that strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. Our analysis reveals two deeper structural failure modes. First, AI reviewers are easier to impress than to convince: highlighting strengths reliably increases perceived merit, while attempts to dissolve weaknesses frequently backfire. Second, AI reviewers can confuse the appearance of addressing a limitation with actually resolving it, allowing unchanged evidence to be reinterpreted as stronger scientific contribution. These results show that the deployment risk is not only malicious hidden instructions, but the emergence of paper presentation itself as an optimization surface. We release a contamination-free rolling benchmark and attack framework for testing whether AI reviewers remain anchored to scientific content under presentation-only edits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Presentation tweaks raise AI reviewer scores, but the stronger attacks change how evidence is interpreted rather than keeping it fixed.

read the letter

The paper shows that a closed-loop search over presentation edits can lift AI reviewer scores by +1.21 on average and succeed 75% of the time across three systems, without touching methods or results. That empirical demonstration and the released benchmark are the concrete contributions.

What stands out is the identification of two structural issues: AI reviewers reward visible strengths more readily than they penalize or fix weaknesses, and they accept the appearance of addressing a limitation as actual resolution. The closed-loop feedback approach to finding those edits is new relative to prior prompt-injection work.

The soft spot is the central framing. The abstract and results emphasize that only presentation changes are made while scientific evidence stays fixed. Yet the highest-performing strategies involve related-work repositioning and analytical discussion expansion. Those moves alter novelty claims and interpretive framing of the same numbers, so they are not pure presentation under any ordinary definition. The 75% success rate therefore rests partly on changes that violate the stated precondition. Surface polishing alone is reported as weaker, which makes the isolation claim harder to sustain.

The work is aimed at groups building or stress-testing AI review pipelines. A serious referee should see it because the measured effect sizes are large enough to matter for deployment decisions and the benchmark is public. The limitation on what counts as presentation versus interpretation is real but fixable with tighter definitions or additional controls; it does not sink the overall result.

Referee Report

1 major / 1 minor

Summary. The paper claims that AI peer reviewers can be successfully attacked ('gamed') using only presentation-level revisions—such as changes to the abstract, contribution framing, related work, discussion, and narrative structure—while keeping all scientific evidence, methods, experiments, figures, equations, and numerical results fixed. It introduces a closed-loop 'adversarial repackaging' attack that uses AI-reviewer feedback to optimize these revisions. Across three mainstream AI reviewers, the approach yields a 75.1% attack success rate and mean score gain of +1.21/10. The paper further identifies two structural failure modes (AI reviewers are easier to impress than to convince; they confuse the appearance of addressing limitations with actual resolution) and releases a contamination-free rolling benchmark and attack framework.

Significance. If the strict separation between presentation and interpretive changes can be maintained, the results would demonstrate that AI reviewers are vulnerable to optimization over narrative framing alone, with implications for any deployment of AI in peer review. The release of a contamination-free rolling benchmark and attack framework is a concrete strength that supports reproducibility and future testing of whether AI reviewers remain anchored to scientific content.

major comments (1)

[Abstract] Abstract: The central claim requires that all revisions keep 'the scientific evidence fixed' and constitute 'presentation-level content' only. However, the abstract explicitly states that 'strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits'. Related-work repositioning alters perceived novelty and contribution claims; analytical discussion expansion reframes interpretive claims about the fixed results. These are not presentation-only under any standard definition and directly violate the 'evidence fixed' precondition, so the reported 75.1% success rate and +1.21 score gain cannot be isolated to presentation effects.

minor comments (1)

The abstract refers to a 'contamination-free rolling benchmark' but provides no details on the contamination checks or rolling mechanism; adding a brief description would improve clarity without affecting the core argument.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed comment on the scope of presentation-level revisions. We respond point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim requires that all revisions keep 'the scientific evidence fixed' and constitute 'presentation-level content' only. However, the abstract explicitly states that 'strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits'. Related-work repositioning alters perceived novelty and contribution claims; analytical discussion expansion reframes interpretive claims about the fixed results. These are not presentation-only under any standard definition and directly violate the 'evidence fixed' precondition, so the reported 75.1% success rate and +1.21 score gain cannot be isolated to presentation effects.

Authors: We thank the referee for this observation. The manuscript explicitly includes related-work repositioning and analytical discussion expansion within the scope of presentation-level revisions, as these operations modify only the narrative structure and framing around the unchanged scientific content. Related-work repositioning entails recontextualizing the contribution by adjusting references to prior work without changing the paper's methods or results. Analytical discussion expansion involves elaborating on the implications and interpretations of the fixed experimental findings. These are not changes to the evidence itself but to how it is presented and interpreted by the reviewer. The paper demonstrates that AI reviewers are susceptible to such framing adjustments, which is the core finding. The distinction from surface edits is intentional, as the results show narrative strategies are more impactful. Thus, the success metrics are for this class of revisions as defined. We maintain that the precondition is satisfied and no revision to the manuscript is necessary. revision: no

Circularity Check

0 steps flagged

No significant circularity; empirical measurement study

full rationale

The paper is an empirical attack study that directly measures attack success rates and score gains on external AI reviewers. No mathematical derivations, equations, fitted parameters, or self-citation load-bearing steps are present in the provided text or abstract. The central results (75.1% success rate, +1.21 mean gain) are obtained by running the attack against independent systems rather than reducing to any input by construction. The noted tension between 'evidence fixed' and 'interpretation-changing strategies' is a validity concern, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that presentation-level edits can be isolated from scientific content and that the three tested AI reviewers are representative of deployed systems.

axioms (1)

domain assumption AI reviewers can be influenced by changes in presentation framing without changes to scientific content.
This is the core premise tested in the study.

pith-pipeline@v0.9.1-grok · 5867 in / 1273 out tokens · 29689 ms · 2026-06-27T06:36:13.923893+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 4 canonical work pages

[1]

Litllms, llms for literature review: Are we there yet?Transactions on Machine Learning Research, 2025

Shubham Agarwal, Gaurav Sahu, Abhay Puri, Issam H Laradji, Krishnamurthy Dj Dvijotham, Jason Stanley, Laurent Charlin, and Christopher Pal. Litllms, llms for literature review: Are we there yet?Transactions on Machine Learning Research, 2025

2025
[2]

Pre-review to peer review: Pitfalls of automating reviews using large language models, 2025

Akhil Pandey Akella, Harish Varma Siravuri, and Shaurya Rohatgi. Pre-review to peer review: Pitfalls of automating reviews using large language models, 2025. URLhttps://arxiv.org/ abs/2512.22145

arXiv 2025
[3]

Stop automating peer review without rigorous evaluation

Joachim Baumann, Jiaxin Pei, Sanmi Koyejo, and Dirk Hovy. Stop automating peer review without rigorous evaluation. InPost-AGI Science and Society Workshop, 2026. URLhttps: //openreview.net/forum?id=cJhlquXIuS

2026
[4]

Ai-assisted peer review at scale: The aaai-26 ai review pilot.arXiv preprint arXiv:2604.13940, 2026

Joydeep Biswas, Sheila Schoepp, Gautham Vasan, Anthony Opipari, Arthur Zhang, Zichao Hu, 11 /wayd-magic-sparklesNo Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions Sebastian Joseph, Matthew Lease, Junyi Jessy Li, Peter Stone, et al. Ai-assisted peer review at scale: The aaai-26 ai review pilot.arXiv preprint arXiv:2604.1...

Pith/arXiv arXiv 2026
[5]

TreeReview: A dynamic tree of questions framework for deep and efficient LLM-based scientific peer review

Yuan Chang, Ziyue Li, Hengyuan Zhang, Yuanbo Kong, Yanru Wu, Hayden Kwok-Hay So, Zhijiang Guo, Liya Zhu, and Ngai Wong. TreeReview: A dynamic tree of questions framework for deep and efficient LLM-based scientific peer review. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on E...

2025
[6]

Rui Xu, Xintao Wang, Jiangjie Chen, Siyu Yuan, Xinfeng Yuan, Jiaqing Liang, Zulong Chen, Xiaoqing Dong, and Yanghua Xiao

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/ 2025.emnlp-main.790. URLhttps://aclanthology.org/2025.emnlp-main.790/

work page doi:10.18653/v1/ 2025
[7]

Pangram predicts 21% of iclr reviews are ai-generated.Pangram Labs Blog, Nov, 2025

Bradley Emi. Pangram predicts 21% of iclr reviews are ai-generated.Pangram Labs Blog, Nov, 2025

2025
[8]

Openreviewer: A specialized large language model for generating critical scientific paper reviews.arXiv preprint arXiv:2412.11948, 2024

Maximilian Idahl and Zahra Ahmadi. Openreviewer: A specialized large language model for generating critical scientific paper reviews.arXiv preprint arXiv:2412.11948, 2024

arXiv 2024
[9]

Badscientist: Can a research agent write convincing but unsound papers that fool llm reviewers? arXiv preprint arXiv:2510.18003, 2025

Fengqing Jiang, Yichen Feng, Yuetai Li, Luyao Niu, Basel Alomair, and Radha Poovendran. Badscientist: Can a research agent write convincing but unsound papers that fool llm reviewers? arXiv preprint arXiv:2510.18003, 2025

Pith/arXiv arXiv 2025
[10]

Is bert really robust? a strong baseline for natural language attack on text classification and entailment

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 8018–8025, 2020

2020
[11]

Paraphrasing adversarial attack on llm-as-a-reviewer.arXiv preprint arXiv:2601.06884, 2026

Masahiro Kaneko. Paraphrasing adversarial attack on llm-as-a-reviewer.arXiv preprint arXiv:2601.06884, 2026

arXiv 2026
[12]

Position: The ai conference peer review crisis demands author feedback and reviewer rewards.arXiv preprint arXiv:2505.04966, 2025

Jaeho Kim, Yunseok Lee, and Seulki Lee. Position: The ai conference peer review crisis demands author feedback and reviewer rewards.arXiv preprint arXiv:2505.04966, 2025

arXiv 2025
[13]

Where do llms go wrong? diagnosing automated peer review via aspect-guided multi-level perturbation

Jiatao Li, Yanheng Li, Xinyu Hu, Mingqi Gao, and Xiaojun Wan. Where do llms go wrong? diagnosing automated peer review via aspect-guided multi-level perturbation. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 1572–1581, 2025

2025
[14]

Llm-reval: Can we trust llm reviewers yet?arXiv preprint arXiv:2510.12367, 2025

Rui Li, Jia-Chen Gu, Po-Nien Kung, Heming Xia, Xiangwen Kong, Zhifang Sui, Nanyun Peng, et al. Llm-reval: Can we trust llm reviewers yet?arXiv preprint arXiv:2510.12367, 2025

arXiv 2025
[15]

Llms cannot reliably judge (yet?): A comprehensive assessment on the robustness of llm-as-a-judge.arXiv preprint arXiv:2506.09443, 2025

Songze Li, Chuokun Xu, Jiaying Wang, Xueluan Gong, Chen Chen, Jirui Zhang, Jun Wang, Kwok-Yan Lam, and Shouling Ji. Llms cannot reliably judge (yet?): A comprehensive assessment on the robustness of llm-as-a-judge.arXiv preprint arXiv:2506.09443, 2025

arXiv 2025
[16]

Monitoring ai-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews.arXiv preprint arXiv:2403.07183, 2024

WeixinLiang,ZacharyIzzo,YaohuiZhang,HaleyLepp,HanchengCao,XuandongZhao,Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, et al. Monitoring ai-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews.arXiv preprint arXiv:2403.07183, 2024

Pith/arXiv arXiv 2024
[17]

Stop ddos attack- ing the research community with ai-generated survey papers.Advances in Neural Information Processing Systems, 38, 2026

Jianghao Lin, Rong Shan, Jiachen Zhu, Yunjia Xi, Yong Yu, and Weinan Zhang. Stop ddos attack- ing the research community with ai-generated survey papers.Advances in Neural Information Processing Systems, 38, 2026

2026
[18]

Yu, and Hong-Han Shuai

Tzu-LingLin, Wei-ChihChen, Teng-FangHsiao, Hou-ILiu, Ya-HsinYeh, Yu-KaiChan, Wen-Sheng Lien, Po-Yen Kuo, Philip S. Yu, and Hong-Han Shuai. Breaking the reviewer: Assessing the 12 /wayd-magic-sparklesNo Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions vulnerability of large language models in automated peer review under t...

work page doi:10.18653/v1/2025.findings-emnlp.259 2025
[19]

Llm comparative assessment: Zero-shot nlg evaluation through pairwise comparisons using large language models

Adian Liusie, Potsawee Manakul, and Mark Gales. Llm comparative assessment: Zero-shot nlg evaluation through pairwise comparisons using large language models. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 139–151, 2024

2024
[20]

Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024

Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024

2024
[21]

Is LLM -as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

Vyas Raina, Adian Liusie, and Mark Gales. Is LLM-as-a-judge robust? investigating universal adversarial attacks on zero-shot LLM assessment. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7499–7517, Miami, Florida, USA, November 2024. Association f...

work page doi:10.18653/v1/2024.emnlp-main.427 2024
[22]

The ai review lottery: Widespread ai-assisted peer reviews boost paper scores and acceptance rates.Proceedings of the ACM on Human-Computer Interaction, 9(7):1–28, 2025

Giuseppe Russo, Manoel Horta Ribeiro, Tim Ruben Davidson, Veniamin Veselovsky, and Robert West. The ai review lottery: Widespread ai-assisted peer reviews boost paper scores and acceptance rates.Proceedings of the ACM on Human-Computer Interaction, 9(7):1–28, 2025

2025
[23]

Exploring the effects of alignment on numerical bias in large language models, 2026

Ayako Sato, Hwichan Kim, Zhousi Chen, Masato Mita, and Mamoru Komachi. Exploring the effects of alignment on numerical bias in large language models, 2026. URLhttps: //arxiv.org/abs/2601.16444

arXiv 2026
[24]

Challenges, experiments, and computational solutions in peer review.Communi- cations of the ACM, 65(6):76–87, 2022

Nihar B Shah. Challenges, experiments, and computational solutions in peer review.Communi- cations of the ACM, 65(6):76–87, 2022

2022
[25]

Mind the Blind Spots:

Hyungyu Shin, Jingyu Tang, Yoonjoo Lee, Nayoung Kim, Hyunseung Lim, Ji Yong Cho, Hwajung Hong, Moontae Lee, and Juho Kim. Mind the blind spots: A focus-level evaluation framework for LLM reviews. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Lan...

work page doi:10.18653/v1/2025.emnlp-main.1805 2025
[26]

A large-scale randomized study of large language model feedback in peer review.Nature Machine Intelligence, pages 1–11, 2026

Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, Animesh Garg, Nanyun Peng, Fei Sha, Rose Yu, Carl Vondrick, and James Zou. A large-scale randomized study of large language model feedback in peer review.Nature Machine Intelligence, pages 1–11, 2026

2026
[27]

Justice in judgment: Unveiling(hidden)biasinllm-assistedpeerreviews.arXivpreprintarXiv:2509.13400, 2025

Sai Suresh Macharla Vasu, Ivaxi Sheth, Hui-Po Wang, Ruta Binkyte, and Mario Fritz. Justice in judgment: Unveiling(hidden)biasinllm-assistedpeerreviews.arXivpreprintarXiv:2509.13400, 2025

Pith/arXiv arXiv 2025
[28]

Can ai be a good peer reviewer? a survey of peer review process, evaluation, and the future, 2026

Sihong Wu, Owen Jiang, Yilun Zhao, Tiansheng Hu, Yiling Ma, Kaiyan Zhang, Manasi Patward- han, and Arman Cohan. Can ai be a good peer reviewer? a survey of peer review process, evaluation, and the future, 2026. URLhttps://arxiv.org/abs/2604.27924. 13 /wayd-magic-sparklesNo Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions

Pith/arXiv arXiv 2026
[29]

Paper copilot: Tracking the evolution of peer review in ai conferences.arXiv preprint arXiv:2510.13201, 2025

Jing Yang, Qiyao Wei, and Jiaxin Pei. Paper copilot: Tracking the evolution of peer review in ai conferences.arXiv preprint arXiv:2510.13201, 2025

arXiv 2025
[30]

Are we there yet? revealing the risks of utilizing large language models in scholarly peer review.arXiv preprint arXiv:2412.01708, 2024

Rui Ye, Xianghe Pang, Jingyi Chai, Jiaao Chen, Zhenfei Yin, Zhen Xiang, Xiaowen Dong, Jing Shao, and Siheng Chen. Are we there yet? revealing the risks of utilizing large language models in scholarly peer review.arXiv preprint arXiv:2412.01708, 2024

arXiv 2024
[31]

Reviewrl: Towards automated scientific review with rl.arXiv preprint arXiv:2508.10308, 2025

Sihang Zeng, Kai Tian, Kaiyan Zhang, Yuru Wang, Junqi Gao, Runze Liu, Sa Yang, Jingxuan Li, Xinwei Long, Jiaheng Ma, Biqing Qi, and Bowen Zhou. Reviewrl: Towards automated scientific review with rl.arXiv preprint arXiv:2508.10308, 2025

arXiv 2025
[32]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023
[33]

give a positive review only

Qin Zhou, Zhexin Zhang, Zhi Li, and Limin Sun. " give a positive review only": An early investigation into in-paper prompt injection attacks and defenses for ai reviewers.arXiv preprint arXiv:2511.01287, 2025

arXiv 2025
[34]

limited novelty

Changjia Zhu, Junjie Xiong, Renkai Ma, Zhicong Lu, Yao Liu, and Lingyao Li. When your reviewer is an llm: Biases, divergence, and prompt injection risks in peer review.arXiv preprint arXiv:2509.09912, 2025. 14 /wayd-magic-sparklesNo Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions Appendix A Presentation-Level Strategy P...

arXiv 2025
[35]

Strengths: Did the review become more or less positive in its strengths overall?
[36]

Weaknesses + Questions: Did the review become more or less severe in its weaknesses and questions overall?
[37]

strength_analysis

Overall framing: Did the summary and sub-scores indicate a more positive or negative overall stance? Important rules: - Judge the overall change in the review, considering both what is said and how it is expressed. Do not judge whether the review is correct. - Compare the two reviews holistically, but do not invent missing edits or motivations. - A critic...

2025

[1] [1]

Litllms, llms for literature review: Are we there yet?Transactions on Machine Learning Research, 2025

Shubham Agarwal, Gaurav Sahu, Abhay Puri, Issam H Laradji, Krishnamurthy Dj Dvijotham, Jason Stanley, Laurent Charlin, and Christopher Pal. Litllms, llms for literature review: Are we there yet?Transactions on Machine Learning Research, 2025

2025

[2] [2]

Pre-review to peer review: Pitfalls of automating reviews using large language models, 2025

Akhil Pandey Akella, Harish Varma Siravuri, and Shaurya Rohatgi. Pre-review to peer review: Pitfalls of automating reviews using large language models, 2025. URLhttps://arxiv.org/ abs/2512.22145

arXiv 2025

[3] [3]

Stop automating peer review without rigorous evaluation

Joachim Baumann, Jiaxin Pei, Sanmi Koyejo, and Dirk Hovy. Stop automating peer review without rigorous evaluation. InPost-AGI Science and Society Workshop, 2026. URLhttps: //openreview.net/forum?id=cJhlquXIuS

2026

[4] [4]

Ai-assisted peer review at scale: The aaai-26 ai review pilot.arXiv preprint arXiv:2604.13940, 2026

Joydeep Biswas, Sheila Schoepp, Gautham Vasan, Anthony Opipari, Arthur Zhang, Zichao Hu, 11 /wayd-magic-sparklesNo Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions Sebastian Joseph, Matthew Lease, Junyi Jessy Li, Peter Stone, et al. Ai-assisted peer review at scale: The aaai-26 ai review pilot.arXiv preprint arXiv:2604.1...

Pith/arXiv arXiv 2026

[5] [5]

TreeReview: A dynamic tree of questions framework for deep and efficient LLM-based scientific peer review

Yuan Chang, Ziyue Li, Hengyuan Zhang, Yuanbo Kong, Yanru Wu, Hayden Kwok-Hay So, Zhijiang Guo, Liya Zhu, and Ngai Wong. TreeReview: A dynamic tree of questions framework for deep and efficient LLM-based scientific peer review. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on E...

2025

[6] [6]

Rui Xu, Xintao Wang, Jiangjie Chen, Siyu Yuan, Xinfeng Yuan, Jiaqing Liang, Zulong Chen, Xiaoqing Dong, and Yanghua Xiao

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/ 2025.emnlp-main.790. URLhttps://aclanthology.org/2025.emnlp-main.790/

work page doi:10.18653/v1/ 2025

[7] [7]

Pangram predicts 21% of iclr reviews are ai-generated.Pangram Labs Blog, Nov, 2025

Bradley Emi. Pangram predicts 21% of iclr reviews are ai-generated.Pangram Labs Blog, Nov, 2025

2025

[8] [8]

Openreviewer: A specialized large language model for generating critical scientific paper reviews.arXiv preprint arXiv:2412.11948, 2024

Maximilian Idahl and Zahra Ahmadi. Openreviewer: A specialized large language model for generating critical scientific paper reviews.arXiv preprint arXiv:2412.11948, 2024

arXiv 2024

[9] [9]

Badscientist: Can a research agent write convincing but unsound papers that fool llm reviewers? arXiv preprint arXiv:2510.18003, 2025

Fengqing Jiang, Yichen Feng, Yuetai Li, Luyao Niu, Basel Alomair, and Radha Poovendran. Badscientist: Can a research agent write convincing but unsound papers that fool llm reviewers? arXiv preprint arXiv:2510.18003, 2025

Pith/arXiv arXiv 2025

[10] [10]

Is bert really robust? a strong baseline for natural language attack on text classification and entailment

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 8018–8025, 2020

2020

[11] [11]

Paraphrasing adversarial attack on llm-as-a-reviewer.arXiv preprint arXiv:2601.06884, 2026

Masahiro Kaneko. Paraphrasing adversarial attack on llm-as-a-reviewer.arXiv preprint arXiv:2601.06884, 2026

arXiv 2026

[12] [12]

Position: The ai conference peer review crisis demands author feedback and reviewer rewards.arXiv preprint arXiv:2505.04966, 2025

Jaeho Kim, Yunseok Lee, and Seulki Lee. Position: The ai conference peer review crisis demands author feedback and reviewer rewards.arXiv preprint arXiv:2505.04966, 2025

arXiv 2025

[13] [13]

Where do llms go wrong? diagnosing automated peer review via aspect-guided multi-level perturbation

Jiatao Li, Yanheng Li, Xinyu Hu, Mingqi Gao, and Xiaojun Wan. Where do llms go wrong? diagnosing automated peer review via aspect-guided multi-level perturbation. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 1572–1581, 2025

2025

[14] [14]

Llm-reval: Can we trust llm reviewers yet?arXiv preprint arXiv:2510.12367, 2025

Rui Li, Jia-Chen Gu, Po-Nien Kung, Heming Xia, Xiangwen Kong, Zhifang Sui, Nanyun Peng, et al. Llm-reval: Can we trust llm reviewers yet?arXiv preprint arXiv:2510.12367, 2025

arXiv 2025

[15] [15]

Llms cannot reliably judge (yet?): A comprehensive assessment on the robustness of llm-as-a-judge.arXiv preprint arXiv:2506.09443, 2025

Songze Li, Chuokun Xu, Jiaying Wang, Xueluan Gong, Chen Chen, Jirui Zhang, Jun Wang, Kwok-Yan Lam, and Shouling Ji. Llms cannot reliably judge (yet?): A comprehensive assessment on the robustness of llm-as-a-judge.arXiv preprint arXiv:2506.09443, 2025

arXiv 2025

[16] [16]

Monitoring ai-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews.arXiv preprint arXiv:2403.07183, 2024

WeixinLiang,ZacharyIzzo,YaohuiZhang,HaleyLepp,HanchengCao,XuandongZhao,Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, et al. Monitoring ai-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews.arXiv preprint arXiv:2403.07183, 2024

Pith/arXiv arXiv 2024

[17] [17]

Stop ddos attack- ing the research community with ai-generated survey papers.Advances in Neural Information Processing Systems, 38, 2026

Jianghao Lin, Rong Shan, Jiachen Zhu, Yunjia Xi, Yong Yu, and Weinan Zhang. Stop ddos attack- ing the research community with ai-generated survey papers.Advances in Neural Information Processing Systems, 38, 2026

2026

[18] [18]

Yu, and Hong-Han Shuai

Tzu-LingLin, Wei-ChihChen, Teng-FangHsiao, Hou-ILiu, Ya-HsinYeh, Yu-KaiChan, Wen-Sheng Lien, Po-Yen Kuo, Philip S. Yu, and Hong-Han Shuai. Breaking the reviewer: Assessing the 12 /wayd-magic-sparklesNo Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions vulnerability of large language models in automated peer review under t...

work page doi:10.18653/v1/2025.findings-emnlp.259 2025

[19] [19]

Llm comparative assessment: Zero-shot nlg evaluation through pairwise comparisons using large language models

Adian Liusie, Potsawee Manakul, and Mark Gales. Llm comparative assessment: Zero-shot nlg evaluation through pairwise comparisons using large language models. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 139–151, 2024

2024

[20] [20]

Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024

Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024

2024

[21] [21]

Is LLM -as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

Vyas Raina, Adian Liusie, and Mark Gales. Is LLM-as-a-judge robust? investigating universal adversarial attacks on zero-shot LLM assessment. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7499–7517, Miami, Florida, USA, November 2024. Association f...

work page doi:10.18653/v1/2024.emnlp-main.427 2024

[22] [22]

The ai review lottery: Widespread ai-assisted peer reviews boost paper scores and acceptance rates.Proceedings of the ACM on Human-Computer Interaction, 9(7):1–28, 2025

Giuseppe Russo, Manoel Horta Ribeiro, Tim Ruben Davidson, Veniamin Veselovsky, and Robert West. The ai review lottery: Widespread ai-assisted peer reviews boost paper scores and acceptance rates.Proceedings of the ACM on Human-Computer Interaction, 9(7):1–28, 2025

2025

[23] [23]

Exploring the effects of alignment on numerical bias in large language models, 2026

Ayako Sato, Hwichan Kim, Zhousi Chen, Masato Mita, and Mamoru Komachi. Exploring the effects of alignment on numerical bias in large language models, 2026. URLhttps: //arxiv.org/abs/2601.16444

arXiv 2026

[24] [24]

Challenges, experiments, and computational solutions in peer review.Communi- cations of the ACM, 65(6):76–87, 2022

Nihar B Shah. Challenges, experiments, and computational solutions in peer review.Communi- cations of the ACM, 65(6):76–87, 2022

2022

[25] [25]

Mind the Blind Spots:

Hyungyu Shin, Jingyu Tang, Yoonjoo Lee, Nayoung Kim, Hyunseung Lim, Ji Yong Cho, Hwajung Hong, Moontae Lee, and Juho Kim. Mind the blind spots: A focus-level evaluation framework for LLM reviews. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Lan...

work page doi:10.18653/v1/2025.emnlp-main.1805 2025

[26] [26]

A large-scale randomized study of large language model feedback in peer review.Nature Machine Intelligence, pages 1–11, 2026

Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, Animesh Garg, Nanyun Peng, Fei Sha, Rose Yu, Carl Vondrick, and James Zou. A large-scale randomized study of large language model feedback in peer review.Nature Machine Intelligence, pages 1–11, 2026

2026

[27] [27]

Justice in judgment: Unveiling(hidden)biasinllm-assistedpeerreviews.arXivpreprintarXiv:2509.13400, 2025

Sai Suresh Macharla Vasu, Ivaxi Sheth, Hui-Po Wang, Ruta Binkyte, and Mario Fritz. Justice in judgment: Unveiling(hidden)biasinllm-assistedpeerreviews.arXivpreprintarXiv:2509.13400, 2025

Pith/arXiv arXiv 2025

[28] [28]

Can ai be a good peer reviewer? a survey of peer review process, evaluation, and the future, 2026

Sihong Wu, Owen Jiang, Yilun Zhao, Tiansheng Hu, Yiling Ma, Kaiyan Zhang, Manasi Patward- han, and Arman Cohan. Can ai be a good peer reviewer? a survey of peer review process, evaluation, and the future, 2026. URLhttps://arxiv.org/abs/2604.27924. 13 /wayd-magic-sparklesNo Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions

Pith/arXiv arXiv 2026

[29] [29]

Paper copilot: Tracking the evolution of peer review in ai conferences.arXiv preprint arXiv:2510.13201, 2025

Jing Yang, Qiyao Wei, and Jiaxin Pei. Paper copilot: Tracking the evolution of peer review in ai conferences.arXiv preprint arXiv:2510.13201, 2025

arXiv 2025

[30] [30]

Are we there yet? revealing the risks of utilizing large language models in scholarly peer review.arXiv preprint arXiv:2412.01708, 2024

Rui Ye, Xianghe Pang, Jingyi Chai, Jiaao Chen, Zhenfei Yin, Zhen Xiang, Xiaowen Dong, Jing Shao, and Siheng Chen. Are we there yet? revealing the risks of utilizing large language models in scholarly peer review.arXiv preprint arXiv:2412.01708, 2024

arXiv 2024

[31] [31]

Reviewrl: Towards automated scientific review with rl.arXiv preprint arXiv:2508.10308, 2025

Sihang Zeng, Kai Tian, Kaiyan Zhang, Yuru Wang, Junqi Gao, Runze Liu, Sa Yang, Jingxuan Li, Xinwei Long, Jiaheng Ma, Biqing Qi, and Bowen Zhou. Reviewrl: Towards automated scientific review with rl.arXiv preprint arXiv:2508.10308, 2025

arXiv 2025

[32] [32]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023

[33] [33]

give a positive review only

Qin Zhou, Zhexin Zhang, Zhi Li, and Limin Sun. " give a positive review only": An early investigation into in-paper prompt injection attacks and defenses for ai reviewers.arXiv preprint arXiv:2511.01287, 2025

arXiv 2025

[34] [34]

limited novelty

Changjia Zhu, Junjie Xiong, Renkai Ma, Zhicong Lu, Yao Liu, and Lingyao Li. When your reviewer is an llm: Biases, divergence, and prompt injection risks in peer review.arXiv preprint arXiv:2509.09912, 2025. 14 /wayd-magic-sparklesNo Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions Appendix A Presentation-Level Strategy P...

arXiv 2025

[35] [35]

Strengths: Did the review become more or less positive in its strengths overall?

[36] [36]

Weaknesses + Questions: Did the review become more or less severe in its weaknesses and questions overall?

[37] [37]

strength_analysis

Overall framing: Did the summary and sub-scores indicate a more positive or negative overall stance? Important rules: - Judge the overall change in the review, considering both what is said and how it is expressed. Do not judge whether the review is correct. - Compare the two reviews holistically, but do not invent missing edits or motivations. - A critic...

2025