Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future

Arman Cohan; Kaiyan Zhang; Manasi Patwardhan; Owen Jiang; Sihong Wu; Tiansheng Hu; Yiling Ma; Yilun Zhao

arxiv: 2604.27924 · v2 · submitted 2026-04-30 · 💻 cs.CL · cs.AI

Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future

Sihong Wu , Owen Jiang , Yilun Zhao , Tiansheng Hu , Yiling Ma , Kaiyan Zhang , Manasi Patwardhan , Arman Cohan This is my paper

Pith reviewed 2026-05-07 07:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelspeer reviewsurveyreview generationrebuttal generationmeta-reviewevaluation methodsAI ethics

0 comments

The pith

Large language models can assist or automate stages of the peer review process from review generation to rebuttals and evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey synthesizes current techniques that apply large language models to the multi-stage peer review pipeline. It organizes methods for generating initial reviews through fine-tuning strategies, agent-based systems, and reinforcement learning approaches, along with ways to produce rebuttals, meta-reviews, and aligned manuscript revisions. The work also covers evaluation techniques that range from human judgments and reference comparisons to LLM-based scoring and aspect-specific analysis, while cataloging datasets and comparing modeling options. A reader would care because peer review sustains research quality yet faces heavy workload demands, and structured guidance on LLM integration could help scale the process without sacrificing standards.

Core claim

The paper claims that recent advances in large language models have motivated methods to assist or automate different stages of the peer review pipeline, and synthesizing techniques for peer review generation, after-review tasks including rebuttals meta-reviews and revisions, and evaluation methods spanning human-centered reference-based LLM-based and aspect-oriented approaches provides practical guidance for building evaluating and integrating LLM systems across the full workflow.

What carries the argument

The full peer review pipeline divided into peer review generation via fine-tuning agent-based RL-based and emerging methods, after-review tasks of rebuttals meta-reviews and revisions, and evaluation through human-centered reference-based LLM-based and aspect-oriented approaches together with datasets and modeling comparisons.

If this is right

Developers can select from fine-tuning strategies agent-based systems or RL methods when building review generators for specific conference or journal needs.
Systems can extend beyond initial reviews to automatically produce rebuttals meta-reviews and revision suggestions aligned to reviewer comments.
Quality assessment can combine human evaluation reference comparisons LLM judges and aspect-oriented checks for more complete feedback.
Limitations and ethical concerns identified in the survey can guide safer deployment decisions in real academic publishing settings.
Future systems can aim for tighter integration of generation after-review and evaluation components into end-to-end pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These tools could first appear in preprint servers or low-stakes venues for initial screening before wider adoption in top journals.
The catalog of datasets could become shared benchmarks that let new research teams compare their peer-review models against prior work.
Similar synthesis approaches might transfer to adjacent tasks such as grant proposal review or journal editing workflows.
Controlled experiments that deploy the surveyed methods on live submissions would surface practical challenges like bias amplification not fully addressed in the current literature.

Load-bearing premise

The published research on LLM-assisted peer review is mature representative and well-documented enough to support a reliable synthesis that yields practical guidance without major gaps or biases.

What would settle it

A follow-up study that identifies many high-impact papers on LLM peer review omitted from the survey or a controlled trial showing that LLM-generated reviews are consistently rated lower quality than human ones by expert panels.

Figures

Figures reproduced from arXiv: 2604.27924 by Arman Cohan, Kaiyan Zhang, Manasi Patwardhan, Owen Jiang, Sihong Wu, Tiansheng Hu, Yiling Ma, Yilun Zhao.

**Figure 1.** Figure 1: Taxonomy of AI for Peer Review Process and Evaluation: Key Areas and Example Systems. view at source ↗

**Figure 2.** Figure 2: The methods of peer review generation: (1) Foundation approaches; (2) Fine-tuning methods; (3) Agent view at source ↗

**Figure 3.** Figure 3: The main evaluation methods that we discuss are: (1) Human-centric evaluation; (2) Reference-based view at source ↗

read the original abstract

Peer review is a multi-stage process involving reviews, rebuttals, meta-reviews, final decisions, and subsequent manuscript revisions. Recent advances in large language models (LLMs) have motivated methods that assist or automate different stages of this pipeline. In this survey, we synthesize techniques for (i) peer review generation, including fine-tuning strategies, agent-based systems, RL-based methods, and emerging paradigms to enhance generation; (ii) after-review tasks including rebuttals, meta-review and revision aligned to reviews; and (iii) evaluation methods spanning human-centered, reference-based, LLM-based and aspect-oriented. We catalog datasets, compare modeling choices, and discuss limitations, ethical concerns, and future directions. The survey aims to provide practical guidance for building, evaluating, and integrating LLM systems across the full peer review workflow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a standard survey that organizes existing LLM work on peer review stages but adds no new methods or data.

read the letter

This survey pulls together techniques for using LLMs in peer review generation, rebuttals and meta-reviews, and evaluation. It groups fine-tuning, agent, and RL approaches for writing reviews, then covers alignment tasks like revisions, and lists human, reference, LLM-based, and aspect-oriented evaluation options. It also catalogs some datasets and flags ethical issues plus future directions. That structure gives a quick map of the landscape if you are new to the topic or need a reference list for related work. The authors stick to describing what others have done and comparing modeling choices at a high level, which keeps the paper readable. No new experiments or derivations appear, so the value sits entirely in the synthesis and organization. The framing stays consistent across sections and does not show internal contradictions or unsupported leaps from the cited papers. Coverage looks reasonable from the abstract and structure, though any survey risks missing the very latest preprints or having selection bias in what gets included. The limitations discussion exists but stays general rather than digging into why certain methods fail in real review settings or how dataset quality affects results. This kind of paper is mainly for researchers already working on AI tools for academic workflows or for editors thinking about automation. It is not aimed at core NLP theory or at readers who want fresh empirical findings. The topic is timely enough that a journal should send it to referees rather than desk-reject it, even if the final version needs tighter scoping and more concrete guidance on practical integration. I would not cite it in my own papers unless I needed a pointer to the collected references.

Referee Report

2 major / 2 minor

Summary. The paper is a survey on the application of large language models (LLMs) to the peer review process. It synthesizes techniques for peer review generation (covering fine-tuning strategies, agent-based systems, RL-based methods, and emerging paradigms), after-review tasks (including rebuttals, meta-reviews, and revisions aligned to reviews), and evaluation methods (spanning human-centered, reference-based, LLM-based, and aspect-oriented approaches). The survey catalogs datasets, compares modeling choices, and discusses limitations, ethical concerns, and future directions, with the aim of providing practical guidance for building, evaluating, and integrating LLM systems across the full peer review workflow.

Significance. If the synthesis is comprehensive and representative, this survey would be significant for the field as it organizes an emerging body of work on LLM-assisted peer review into a coherent workflow view. The catalog of datasets and modeling comparisons is a clear strength that supports reproducibility and follow-on research. The dedicated coverage of ethical concerns and limitations helps frame responsible adoption, and the practical guidance could accelerate integration efforts if the cited literature is shown to be mature and unbiased.

major comments (2)

[§3 (Peer Review Generation)] §3 (Peer Review Generation): The synthesis of RL-based methods would be strengthened by explicit discussion of how reward models are constructed or aligned to review quality criteria (e.g., soundness, novelty, clarity); without this, the practical guidance on RL approaches risks being too high-level to be actionable.
[§2 or Methods] §2 or Methods: The survey does not appear to state explicit inclusion/exclusion criteria or search strategy for selecting the cited works; this is load-bearing for the central claim of providing reliable practical guidance, as it leaves open the possibility of selection bias in a rapidly growing literature.

minor comments (2)

[Abstract] Abstract: Adding the approximate number of papers or studies included (and the search time window) would help readers gauge the survey's scope and recency.
[Evaluation section] Evaluation section: Tables comparing modeling choices could include columns for reported performance metrics or dataset sizes to make cross-method comparisons more concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation for minor revision. The comments help improve the survey's methodological transparency and the actionability of its practical guidance on LLM-assisted peer review. We address each major comment below.

read point-by-point responses

Referee: [§3 (Peer Review Generation)] The synthesis of RL-based methods would be strengthened by explicit discussion of how reward models are constructed or aligned to review quality criteria (e.g., soundness, novelty, clarity); without this, the practical guidance on RL approaches risks being too high-level to be actionable.

Authors: We agree that expanding the discussion of reward model construction would make the RL subsection more actionable. The current §3 overviews RL-based methods for review generation but does not explicitly break down how rewards are defined or aligned to quality criteria. In the revision, we will add a focused paragraph (with citations to the surveyed works) explaining common approaches to reward modeling, such as using human-rated review scores for soundness/novelty/clarity or learned reward models from preference data. This directly addresses the concern while remaining faithful to the cited literature. revision: yes
Referee: [§2 or Methods] The survey does not appear to state explicit inclusion/exclusion criteria or search strategy for selecting the cited works; this is load-bearing for the central claim of providing reliable practical guidance, as it leaves open the possibility of selection bias in a rapidly growing literature.

Authors: We acknowledge that an explicit search strategy and inclusion/exclusion criteria are not stated in the current version. To strengthen the claim of providing reliable guidance, we will add a dedicated paragraph in §2 detailing the literature search process (databases such as arXiv, ACL Anthology, and Google Scholar; keywords combining 'LLM'/'large language model' with 'peer review'/'review generation'; time frame 2022–2024; inclusion of both peer-reviewed and high-quality preprint works directly addressing LLM use in any stage of peer review). This addition will document our selection process and reduce concerns about bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity: standard survey synthesis of external literature

full rationale

This is a survey paper whose central contribution is cataloging and synthesizing existing techniques, datasets, and methods from the broader literature on LLM-assisted peer review. It contains no original mathematical derivations, equations, fitted parameters, predictions, or ansatzes. The abstract and structure explicitly frame the work as a synthesis of (i) peer review generation methods, (ii) after-review tasks, and (iii) evaluation approaches drawn from cited external sources. No load-bearing self-citations reduce the claims to unverified internal loops; any author self-citations (if present) would be incidental and not foundational to the survey's organization. The paper follows a conventional survey format with dedicated sections on limitations, ethics, and future directions, without renaming known results or smuggling assumptions via citation chains. The derivation chain is therefore self-contained against external benchmarks, with no reductions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a survey paper the central contribution is synthesis rather than new theory or experiments. It rests on the domain assumption that peer review consists of identifiable stages that can be assisted by LLMs and that existing published methods form a coherent body worth cataloging. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Peer review is a multi-stage process involving reviews, rebuttals, meta-reviews, final decisions, and subsequent manuscript revisions.
Explicitly stated in the abstract as the foundation for organizing the surveyed techniques.

pith-pipeline@v0.9.0 · 5468 in / 1449 out tokens · 80967 ms · 2026-05-07T07:39:44.798210+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 1 internal anchor

[1]

URLhttps://arxiv.org/abs/2303.08774. Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models,

work page internal anchor Pith review arXiv
[2]

Zachary Robertson

URLhttps://arxiv.org/abs/2303.18223. Zachary Robertson. Gpt4 is slightly helpful for peer-review assistance: A pilot study, 2023. URL https: //arxiv.org/abs/2307.05492. M. Hosseini and S. P. J. M. Horbach. Fighting reviewer fatigue or amplifying bias? considerations and recom- mendations for use of chatgpt and other large language models in scholarly peer...

work page doi:10.1186/s41073-023-00133-5 2023
[3]

PlanGenLLMs: A modern survey of LLM planning capabilities

Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long

work page doi:10.18653/v1/2025.acl-long 2025
[4]

Kun-Peng Ning, Shuo Yang, Yu-Yang Liu, Jia-Yu Yao, Zhen-Hui Liu, Yong-Hong Tian, Yibing Song, and Li Yuan

URLhttps://aclanthology.org/2025.acl-long.1107/. Kun-Peng Ning, Shuo Yang, Yu-Yang Liu, Jia-Yu Yao, Zhen-Hui Liu, Yong-Hong Tian, Yibing Song, and Li Yuan. Pico: Peer review in llms based on the consistency optimization, 2025. URL https://arxiv.org/abs/ 2402.01830. Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas V odra...

work page doi:10.17705/1jais.00867 2025
[5]

SimPO: Simple Preference Optimization with a Reference- Free Reward, November 2024

URLhttps://arxiv.org/abs/2405.14734. Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation,

work page arXiv
[6]

From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI , volume=

URLhttps://arxiv.org/abs/2505.18705. Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. Retrieval augmentation reduces hallucination in conversation, 2021. URLhttps://arxiv.org/abs/2104.07567. ´Ad´am Kov´acs and G´abor Recski. Lettucedetect: A hallucination detection framework for rag applications, 2025. URLhttps://arxiv.org/abs/2502.17...

work page doi:10.1145/3583558 2021
[7]

reject-option

doi: 10.1007/s40273-019-00844-y. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, 2025. URLhttps://arxiv.org/abs/2411.15594. Yen-Ting Lin and Yun-Nung Chen. Llm-eval: Unified multi...

work page doi:10.1007/s40273-019-00844-y 2025

[1] [1]

URLhttps://arxiv.org/abs/2303.08774. Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models,

work page internal anchor Pith review arXiv

[2] [2]

Zachary Robertson

URLhttps://arxiv.org/abs/2303.18223. Zachary Robertson. Gpt4 is slightly helpful for peer-review assistance: A pilot study, 2023. URL https: //arxiv.org/abs/2307.05492. M. Hosseini and S. P. J. M. Horbach. Fighting reviewer fatigue or amplifying bias? considerations and recom- mendations for use of chatgpt and other large language models in scholarly peer...

work page doi:10.1186/s41073-023-00133-5 2023

[3] [3]

PlanGenLLMs: A modern survey of LLM planning capabilities

Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long

work page doi:10.18653/v1/2025.acl-long 2025

[4] [4]

Kun-Peng Ning, Shuo Yang, Yu-Yang Liu, Jia-Yu Yao, Zhen-Hui Liu, Yong-Hong Tian, Yibing Song, and Li Yuan

URLhttps://aclanthology.org/2025.acl-long.1107/. Kun-Peng Ning, Shuo Yang, Yu-Yang Liu, Jia-Yu Yao, Zhen-Hui Liu, Yong-Hong Tian, Yibing Song, and Li Yuan. Pico: Peer review in llms based on the consistency optimization, 2025. URL https://arxiv.org/abs/ 2402.01830. Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas V odra...

work page doi:10.17705/1jais.00867 2025

[5] [5]

SimPO: Simple Preference Optimization with a Reference- Free Reward, November 2024

URLhttps://arxiv.org/abs/2405.14734. Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation,

work page arXiv

[6] [6]

From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI , volume=

URLhttps://arxiv.org/abs/2505.18705. Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. Retrieval augmentation reduces hallucination in conversation, 2021. URLhttps://arxiv.org/abs/2104.07567. ´Ad´am Kov´acs and G´abor Recski. Lettucedetect: A hallucination detection framework for rag applications, 2025. URLhttps://arxiv.org/abs/2502.17...

work page doi:10.1145/3583558 2021

[7] [7]

reject-option

doi: 10.1007/s40273-019-00844-y. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, 2025. URLhttps://arxiv.org/abs/2411.15594. Yen-Ting Lin and Yun-Nung Chen. Llm-eval: Unified multi...

work page doi:10.1007/s40273-019-00844-y 2025