On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

Akari Asai; Aleksandar Shulevski; Alice Oh; Amanda Montoya; Arthur Porto; Biljana Mitreska; Biljana Mojsoska; Carolin Lawrence; Changwon Yoon; Christian Langkammer

arxiv: 2605.20668 · v1 · pith:KZGX6Y72 · submitted 2026-05-20 · cs.CL · cs.AI· cs.LG

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

Seungone Kim , Dongkeun Yoon , Kiril Gashteovski , Juyoung Suk , Jinheon Baek , Pranjal Aggarwal , Ian Wu , Viktor Zaverkin

show 50 more authors

Spase Petkoski Daniel R. Schrider Ilija Dukovski Francesco Santini Biljana Mitreska Yong Jeong Kyeongha Kwon Young Min Sim Dragana Manasova Arthur Porto Biljana Mojsoska Makoto Takamoto Marko Shuntov Ruoqi Liu Hyunjoo Jenny Lee Niyazi Ulas Din\c{c} Yehhyun Jo Sunkyu Han Chungwoo Lee Huishan Li Esther H. R. Tsai Ergun Simsek Khushboo Shafi Yeonseung Chung Jihye Park Aleksandar Shulevski Henrik Christiansen Yoosang Son Elly Knight Amanda Montoya Jeongyoun Ahn Christian Langkammer Heera Moon Changwon Yoon Nikola Stikov Mooseok Jang Edward Choi Junhan Kim Yeon Sik Jung Woo Youn Kim Jae Kyoung Kim Ishraq Md Anjum Hyun Uk Kim Drew Bridges Carolin Lawrence Xiang Yue Alice Oh Akari Asai Sean Welleck Graham Neubig

This is my paper

Reviewed by Pith T0 review T1 audit T2 compute T3 formal T4 kernel 2026-05-21 05:19 UTCgrok-4.3pith:KZGX6Y72 record.json open to challenge →

classification cs.CL cs.AIcs.LG

keywords AI peer reviewscientific publishinglarge language modelsexpert annotationreview qualityNature journals

0 comments

The pith

An AI reviewer powered by GPT-5.2 outperforms the top human reviewer on a composite of correctness, significance, and evidence for Nature-family papers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares AI-generated and human-written reviews of 82 Nature-family papers by having 45 domain scientists rate 2,960 individual criticisms on three dimensions. Experts spent 469 hours producing these ratings. Results show the best AI system beats the highest-rated human reviewer on the composite score while all tested AI systems beat the lowest-rated human across every dimension. AI reviews also raise some unique issues but share more overlap with each other and display recurring weaknesses absent in human reviews.

Core claim

Through expert annotation of criticisms from reviews of Nature-family papers, the work shows that an AI agent using GPT-5.2 achieves a 60.0 percent composite score on correctness, significance, and evidential sufficiency, exceeding the 48.2 percent of each paper's top human reviewer. All three AI reviewers surpass the lowest human reviewer on every dimension, their accurate criticisms tend to be rated more significant and well-evidenced, and they identify a distinct 26 percent of issues no human raises, yet AI reviewers overlap more than humans do and share 16 specific weaknesses.

What carries the argument

Expert ratings of individual review criticisms on correctness, significance, and sufficiency of evidence.

If this is right

AI reviewers can complement humans by surfacing issues that experts miss.
Peer review systems could integrate AI to increase coverage of paper aspects.
AI development should target the identified weaknesses such as limited subfield knowledge and long-context handling.
Current AI systems are positioned as assistants rather than replacements for human reviewers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid human-AI review workflows might combine unique strengths to improve overall quality.
Testing AI reviewers on live submissions rather than post-publication reviews would reveal practical performance.
Extending the evaluation to other journal families could show whether the pattern holds beyond high-profile outlets.

Load-bearing premise

The 45 domain scientists provide unbiased and reliable ratings of the criticisms without systematic inter-rater differences or selection effects from the chosen papers.

What would settle it

New experts independently rating the same 2,960 criticisms and producing no significant difference between the GPT-5.2 agent and the top human reviewer would falsify the outperformance result.

Figures

Figures reproduced from arXiv: 2605.20668 by Akari Asai, Aleksandar Shulevski, Alice Oh, Amanda Montoya, Arthur Porto, Biljana Mitreska, Biljana Mojsoska, Carolin Lawrence, Changwon Yoon, Christian Langkammer, Chungwoo Lee, Daniel R. Schrider, Dongkeun Yoon, Dragana Manasova, Drew Bridges, Edward Choi, Elly Knight, Ergun Simsek, Esther H. R. Tsai, Francesco Santini, Graham Neubig, Heera Moon, Henrik Christiansen, Huishan Li, Hyunjoo Jenny Lee, Hyun Uk Kim, Ian Wu, Ilija Dukovski, Ishraq Md Anjum, Jae Kyoung Kim, Jeongyoun Ahn, Jihye Park, Jinheon Baek, Junhan Kim, Juyoung Suk, Khushboo Shafi, Kiril Gashteovski, Kyeongha Kwon, Makoto Takamoto, Marko Shuntov, Mooseok Jang, Nikola Stikov, Niyazi Ulas Din\c{c}, Pranjal Aggarwal, Ruoqi Liu, Sean Welleck, Seungone Kim, Spase Petkoski, Sunkyu Han, Viktor Zaverkin, Woo Youn Kim, Xiang Yue, Yehhyun Jo, Yeonseung Chung, Yeon Sik Jung, Yong Jeong, Yoosang Son, Young Min Sim.

**Figure 1.** Figure 1: Illustration of the motivation behind our expert annotation study. Given a human-written review and an AI-generated review based on the same academic paper, prior works used shallow heuristics such as score correlation and acceptance matching to determine the quality of the AI-generated review. However, producing similar scores or matching accept/reject recommendations doesn’t indicate that the AI-generate… view at source ↗

**Figure 2.** Figure 2: Two example review items written by the same human reviewer of a paper in the Physical Sciences. Each [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: An example of a review item produced by an AI reviewer for the same paper as [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: AI reviewers overlap with each other much more than humans do, while AI panels match most human targets but only about half of the specific criticisms. (Left) Distribution of cross-reviewer item pairs across the four similarity categories, for Human-Human, Human-AI, and AI-AI pair types. (Right) Fraction of one reviewer’s items covered by another at three progressively stricter similarity thresholds: at le… view at source ↗

**Figure 5.** Figure 5: Strengths and weaknesses of AI reviewers identified by domain experts. Distribution of 442 free-form comments on AI reviews across 16 weakness categories (left, n = 260) and 6 strength categories (right, n = 132). Dark bars are item-level comments; light bars are paper-level comments. Categories are sorted by total count. 5.1 Failure cases: Limitations of AI reviewers Overview The five most frequently cite… view at source ↗

**Figure 6.** Figure 6: Reviewer prompt (Part 1 of 3): task description, principles, and rules for constructing each review item. [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 7.** Figure 7: Reviewer prompt (Part 2 of 3): required output format for each review item and the citation list, and the [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: Reviewer prompt (Part 3 of 3): task workflow, guidelines for reading the paper and its code, guidelines for [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: The annotation sheet presented to each domain scientist. [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: Illustration of the motivation behind the similarity analysis. For each paper in our expert-annotation study, we obtain six reviews: three from human reviewers and three from AI reviewers (top panel). To quantify how similar any two reviews are (e.g., human–human, human–AI, or AI–AI), we compare every review item in one review against every review item in the other and classify each item pair into one of … view at source ↗

**Figure 11.** Figure 11: Meta-reviewer prompt (Part 1 of 4): role, paper context, and the three principles that establish the bar for [PITH_FULL_IMAGE:figures/full_fig_p084_11.png] view at source ↗

**Figure 12.** Figure 12: Meta-reviewer prompt (Part 2 of 4): the per-item decision procedure, with Part A producing the agent’s [PITH_FULL_IMAGE:figures/full_fig_p085_12.png] view at source ↗

**Figure 13.** Figure 13: Meta-reviewer prompt (Part 3 of 4): consistency constraint linking the predicted ten-class label to the [PITH_FULL_IMAGE:figures/full_fig_p086_13.png] view at source ↗

**Figure 14.** Figure 14: Meta-reviewer prompt (Part 4 of 4): verification checklist before finishing, filesystem and access [PITH_FULL_IMAGE:figures/full_fig_p087_14.png] view at source ↗

read the original abstract

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AI reviewers beat the top human on composite criticism scores for these Nature papers but overlap more with each other and show 16 recurring weaknesses that humans avoid.

read the letter

AI reviewers come out ahead of the best human reviewer on a composite of correctness, significance, and evidence for individual criticisms in this study of 82 Nature papers. GPT-5.2 hits 60% while the top human is at 48%, with a p-value of 0.009. All AIs also beat the weakest human across the board. They raise some unique issues too. The real contribution is the scale and granularity. Breaking reviews into 2,960 specific criticisms and getting 45 domain experts to rate them on three separate dimensions gives a much clearer picture than verdict-matching studies. The authors identify 26% of issues that only AI surfaces and 16 weaknesses that AI shows repeatedly, such as weak subfield knowledge and trouble with long contexts across files. This helps move the discussion from 'can AI review?' to 'what exactly does it do well or poorly?' The work is grounded in actual high-stakes papers and expert time, which is better than synthetic tests. The numbers on overlap (AI reviews agree more with each other than humans do) and the positioning as complements feel practical. That said, the abstract leaves some gaps that matter for the main claim. We don't get the annotation guidelines, how much the 45 raters overlapped on the same criticisms, or exactly how papers were picked from the Nature family. If rater leniency varies or if the sample favors papers where AI happens to do better, the 60 versus 48 gap could move. The composite score construction also isn't detailed here. This is useful for journal editors weighing AI tools and for researchers working on better review agents. Readers who want data on LLM limitations in expert domains will find the weakness list valuable. It deserves peer review because the core design is straightforward and the sample is large enough to be informative, even if methods need tightening. I'd recommend sending it to referees, with a request to examine the rating process and paper selection for any hidden biases.

Referee Report

3 major / 2 minor

Summary. The paper describes a large-scale annotation study in which 45 domain scientists in the Physical, Biological, and Health Sciences spent 469 hours evaluating 2,960 individual criticisms extracted from human and AI-generated reviews of 82 Nature-family papers. Each criticism was rated on three dimensions: correctness, significance, and sufficiency of evidence. The results indicate that AI-powered reviewers, particularly one based on GPT-5.2, achieve higher composite scores than the best human reviewer for each paper (60.0% versus 48.2%, with p = 0.009). Additionally, all AI reviewers outperform the lowest-rated human reviewers across all dimensions, identify a distinct set of issues (26% unique), but demonstrate greater overlap among their reviews (21% vs. 3% for humans) and share 16 recurring weaknesses not observed in human reviews, such as limited subfield knowledge and difficulties with long context.

Significance. If the expert ratings prove reliable, the findings offer valuable insights into the capabilities and limitations of AI in peer review. The scale of the study, with hundreds of hours of expert input and thousands of ratings, strengthens the evidence that AI can serve as a complement to human reviewers by surfacing unique issues. This has implications for improving the efficiency and thoroughness of scientific peer review processes.

major comments (3)

[Methods] Methods: The annotation protocol for the 45 domain scientists, including how the 2,960 criticisms were selected, presented to raters, and any training or calibration procedures, is not described in sufficient detail. This is load-bearing for the central claim because the composite scores and the 60.0% vs 48.2% comparison (p=0.009) rest entirely on these ratings as ground truth.
[Results] Results: No inter-rater agreement statistics (such as Fleiss' kappa, percentage agreement, or overlap metrics across the 45 raters) are reported for the ratings on correctness, significance, and evidential sufficiency. Systematic differences in rater leniency or interpretation could directly alter which criticisms are scored highly and shift the per-paper top-human baseline.
[Methods] Methods: The selection criteria and process for the 82 Nature-family papers are not elaborated, including any stratification by field or controls for potential biases. This affects the generalizability of the finding that AI exceeds top humans on the composite metric.

minor comments (2)

[Abstract] Abstract: Clarify whether 'GPT-5.2' refers to a specific deployed model or a hypothetical/future version, as this affects interpretation of the performance numbers.
[Results] Results: Consider adding a table breaking down the three individual dimensions (correctness, significance, sufficiency) for AI vs. human reviewers to support the composite score claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which help us improve the clarity and rigor of our manuscript. We address each major comment below and have made revisions to incorporate additional details where appropriate.

read point-by-point responses

Referee: [Methods] Methods: The annotation protocol for the 45 domain scientists, including how the 2,960 criticisms were selected, presented to raters, and any training or calibration procedures, is not described in sufficient detail. This is load-bearing for the central claim because the composite scores and the 60.0% vs 48.2% comparison (p=0.009) rest entirely on these ratings as ground truth.

Authors: We agree that a more detailed description of the annotation protocol is warranted to support the reliability of the ratings. In the revised manuscript, we will expand the Methods section with a step-by-step account of how the 2,960 criticisms were extracted from the reviews, the criteria used for their selection and presentation to raters, the structure of the annotation interface, and the training and calibration procedures provided to the 45 domain scientists. This will include examples of rating guidelines and interface screenshots to enhance reproducibility. revision: yes
Referee: [Results] Results: No inter-rater agreement statistics (such as Fleiss' kappa, percentage agreement, or overlap metrics across the 45 raters) are reported for the ratings on correctness, significance, and evidential sufficiency. Systematic differences in rater leniency or interpretation could directly alter which criticisms are scored highly and shift the per-paper top-human baseline.

Authors: We acknowledge the importance of reporting inter-rater agreement to assess rating reliability. We have calculated Fleiss' kappa and percentage agreement for each of the three rating dimensions across the 45 raters and will add these statistics, along with a brief discussion of their implications, to the revised Results section. This addition will help address concerns about potential systematic differences in rater leniency. revision: yes
Referee: [Methods] Methods: The selection criteria and process for the 82 Nature-family papers are not elaborated, including any stratification by field or controls for potential biases. This affects the generalizability of the finding that AI exceeds top humans on the composite metric.

Authors: We agree that explicit details on paper selection are needed for evaluating generalizability. In the revised Methods section, we will provide a full description of the selection criteria for the 82 Nature-family papers, including stratification by field (Physical, Biological, and Health Sciences), the sampling procedure, time period, and any measures taken to control for biases such as paper length or topic distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims rest on a new empirical study in which 45 independent domain scientists provided 2,960 ratings of individual criticisms across correctness, significance, and evidential sufficiency for both human and AI reviews of 82 papers. These external expert judgments serve as ground truth for the composite score comparisons (e.g., 60.0% vs 48.2%), with no equations, fitted parameters, self-definitional constructs, or load-bearing self-citations that reduce the reported results to the authors' prior inputs by construction. The methodology is self-contained against the collected annotations and does not invoke uniqueness theorems or ansatzes from the authors' own previous work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical human-annotation study that treats expert scientist ratings as the evaluation standard. It introduces no new theoretical entities, fitted parameters, or ad-hoc axioms beyond standard assumptions of annotation studies.

axioms (1)

domain assumption Domain experts can reliably judge the correctness, significance, and evidential support of individual review criticisms.
The entire comparative analysis depends on the 45 annotators' scores serving as valid ground truth.

pith-pipeline@v0.9.0 · 6153 in / 1382 out tokens · 69744 ms · 2026-05-21T05:19:53.377566+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rebuttals Move Peer-Review Scores, but Initial-Review Structure Bounds the Movement
cs.DL 2026-06 unverdicted novelty 7.0

Rebuttals shift peer-review scores in ways largely bounded by initial-review structure, with LLM-derived exchange features providing only modest additional predictive power.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 1 Pith paper

[1]

Any incorrect or unsupported criticism will undermine the credibility of your review

Your review must be factually correct: Your claims will be checked by domain experts. Any incorrect or unsupported criticism will undermine the credibility of your review. When uncertain, avoid speculation

work page
[2]

Do not focus on minor or cosmetic issues

Your review must consist of only significant issues: Only point out problems that meaningfully affect the paper’s validity, soundness, methodology, claims, or reproducibility. Do not focus on minor or cosmetic issues. If you think there are less than five significant issues, then you should output less than five items (even zero items are allowed if there...

work page
[3]

Specifically, mention the contextual background of what the authors attempted to do, and why that was not sufficient when comparing to common practices in the field

Your review must be concise and only criticize at most five major aspects with detailed evidence: Each criticism must be supported with detailed evidence. Specifically, mention the contextual background of what the authors attempted to do, and why that was not sufficient when comparing to common practices in the field. ### Rules for constructing each item

work page
[4]

Each item consists of exactly two components: a claim and evidence

work page
[5]

In the claim, you must clearly state: a

The claim is the criticism itself. In the claim, you must clearly state: a. What you are criticizing the paper for. b. On which evaluation criterion or criteria the criticism is based. c. Which component of the paper the criticism refers to

work page
[6]

You should quote: a

The evidence must directly support the claim. You should quote: a. Exact sentences from the main paper or supplementary materials. b. Exact code blocks or functions from the paper’s code. c. Exact sentences from papers in the literature (hyperlinked and cited)

work page
[7]

At the end of the review, include a citation list containing all literature references used in your evidence

work page
[8]

It must contain at most five items and a citation list

The review must not include an introduction, summary, or concluding remarks. It must contain at most five items and a citation list

work page
[9]

All output must be valid markdown

work page
[10]

You must separate each item with a blank line

work page
[11]

Limitations

Try to avoid using what the paper listed in the "Limitations" or "Future work" section as your claim unless it is a significant issue

work page
[12]

The items should be sorted by their importance

work page
[13]

Figure 6: Reviewer prompt (Part 1 of 3): task description, principles, and rules for constructing each review item

Use the format Item 1, Item 2, ..., with no fraction or denominator. Figure 6: Reviewer prompt (Part 1 of 3): task description, principles, and rules for constructing each review item. ### Required structure and format of each item Each item must be formatted exactly as follows: ## Item N: <short title summarizing the criticism> #### Claim * Main point of...

work page
[14]

<citation 1> (hyperlinked to the retrieved literature)

work page
[15]

<citation 2> (hyperlinked to the retrieved literature)

work page
[16]

### Evaluation criteria (ordered by importance)

<citation 3> (hyperlinked to the retrieved literature) There should be at least five citations in the citation list. ### Evaluation criteria (ordered by importance)

work page
[17]

Validity: Does the manuscript have significant flaws which should prohibit its publication?

work page
[18]

Conclusions: Are the conclusions and data interpretation robust, valid and reliable?

work page
[19]

Originality and significance: Are the results presented of immediate interest to many people in the field of study, and/or to people from several disciplines?

work page
[20]

Data and methodology: Is the reporting of data and methodology sufficiently detailed and transparent to enable reproducing the results?

work page
[21]

Appropriate use of statistics and treatment of uncertainties: Are all error bars defined in the corresponding figure legends and are all statistical tests appropriate and the description of any error bars and probability values accurate?

work page
[22]

Figure 7: Reviewer prompt (Part 2 of 3): required output format for each review item and the citation list, and the six Nature evaluation criteria ordered by priority

Clarity and context: Is the abstract clear, accessible? Are abstract, introduction and conclusions appropriate? Note that earlier evaluation criteria should be prioritized over later ones when deciding the items in the review. Figure 7: Reviewer prompt (Part 2 of 3): required output format for each review item and the citation list, and the six Nature eva...

work page
[23]

Check it before trying to run the code

The code may include a README file that explains the purpose of the code and how to run it. Check it before trying to run the code

work page
[24]

If the code is not executable, try to resolve dependencies, download the necessary datasets, and run the code to validate your claims

work page
[25]

### Guidelines for retrieving literature

Do not try to run the code if it is non-executable or resource-prohibitive. ### Guidelines for retrieving literature

work page
[26]

Determine which papers are most relevant

Do not iterate through all the papers included in the paper’s references. Determine which papers are most relevant

work page
[27]

Be proactive and add search queries during the review process

work page
[28]

It is recommended not only to retrieve academic papers, but also blog posts, news articles, datasets, and code repositories

work page
[29]

### Tips

Ensure you actually read what you retrieved. ### Tips

work page
[30]

Do not assume the paper is incorrect solely because of OCR mistakes

The paper’s markdown may contain OCR errors. Do not assume the paper is incorrect solely because of OCR mistakes. Do not point out that the manuscript is incomplete due to formatting issues

work page
[31]

Do not point out broken or missing figure assets

Image filenames are guaranteed to be figure1.png, figure2.png, etc. Do not point out broken or missing figure assets

work page
[32]

The code you are reviewing does not need to be perfect; focus on major issues such as non-reproducible experiments or mismatches with descriptions rather than minor issues

work page
[33]

Which human reviewer do you think provided the best quality review overall?

When refining your review, ensure that all items are factually correct, significant, and mutually exclusive. Figure 8: Reviewer prompt (Part 3 of 3): task workflow, guidelines for reading the paper and its code, guidelines for retrieving literature, and additional tips. Figure 9: The annotation sheet presented to each domain scientist.(Left) item-level an...

work page 2025
[34]

the authors should discuss this limitation

The main added value is the specific Ali et al. (2025) reference demonstrating synergistic effects with quantified RRs, which provides stronger empirical backing than the first reviewer offered for the same point. P60 · Claude 4.5 · item 2 ·secondary This is essentially the same critique as the AI reviewer 1 Item 2, which already identified that removing ...

work page 2025
[35]

Correct + Sig. + Evi. Sufficient 277 (30.5%)

work page
[36]

Correct + Sig. + Evi. Not Suff. 2 (0.2%)

work page
[37]

Correct + Sig. + Evi. Disagree 20 (2.2%)

work page
[38]

Correct + Marg. Sig. + Evi. Sufficient 74 (8.1%)

work page
[39]

Correct + Marg. Sig. + Evi. Not Suff. 4 (0.4%)

work page
[40]

Correct + Marg. Sig. + Evi. Disagree 13 (1.4%)

work page
[41]

Correct + Not Significant 55 (6.1%)

work page
[42]

Disagree 298 (32.8%)

Correct + Sig. Disagree 298 (32.8%)

work page
[43]

Not Correct 36 (4.0%)

work page
[44]

Sig.” = both significant; “marg

Disagree on Correctness 129 (14.2%) Table 46:Calibration set statistics.Each of 908 review items from 27 dual-annotated papers carries a 10-class ground truth label encoding both the cascade outcome (correctness → significance → evidence) and inter-annotator agreement. “Sig.” = both significant; “marg.” = both marginally significant. Theprimary settingis ...

work page 2024
[45]

Judge the item along three axes: correctness, significance, and evidence sufficiency (your own meta-review judgment)

work page
[46]

You are NOT writing a new review

Predict how two independent expert meta-reviewers would jointly judge the item, expressed as one of 10 collapsed class labels that encode both the cascade outcome and inter-expert agreement. You are NOT writing a new review. You are judging existing review items by verifying their claims against the paper. ### Paper location The paper’s source files are a...

work page
[47]

If the reviewer’s core concern is valid even though one specific supporting claim is inaccurate, the item is still Correct

Correctness -- judge the main point, not peripheral details. If the reviewer’s core concern is valid even though one specific supporting claim is inaccurate, the item is still Correct. Only mark Not Correct when the main point itself is wrong

work page
[48]

would this improve the paper?

Significance -- the bar is "would this improve the paper?" Any criticism that would genuinely help the paper if addressed is Significant -- it does NOT need to threaten the paper’s validity. Missing statistics, undefined figure annotations, unreported methodological details, internal inconsistencies between text and figures, and missing ablations are typi...

work page
[49]

Correct". Continue to Step 3. - Main point itself factually wrong? ->

Evidence -- verifiability, not exhaustiveness. If a meta-reviewer can verify the reviewer’s claim from what the reviewer wrote plus the paper, the evidence is Sufficient. When the reviewer’s criticism IS that something is absent, identifying the specific absence IS the evidence. Reserve Requires More for cases where the meta-reviewer cannot even locate wh...

work page
[50]

Read the file back and verify it is valid JSON (no syntax errors, no trailing commas, no truncated content)

work page
[51]

Count reviewers: number of reviewer entries must match number of .md files in the reviews/ directory

work page
[52]

Count items per reviewer: number of item entries must match the number of "## Item" sections in that reviewer’s .md file

work page
[53]

Correct" or

Check label strings: every correctness value must be exactly "Correct" or "Not Correct"; every prediction must be one of the 10 valid strings (listed below)

work page
[54]

Check consistency: prediction must agree with axis labels

work page
[55]

Verification complete. All [N] reviewers and [M] total items included. Prediction written to {output_file}

If any check fails, fix the file and re-verify. Only after all checks pass, print: "Verification complete. All [N] reviewers and [M] total items included. Prediction written to {output_file}" Then stop. ### Filesystem boundaries - READ from {paper_preprint_dir} and {paper_reviews_dir}. These are the paper’s source files. Do not modify anything there. - WR...

work page
[56]

Do not penalize reviewers for pointing out things that are actually OCR artifacts; infer content from context

The paper’s markdown may contain OCR errors. Do not penalize reviewers for pointing out things that are actually OCR artifacts; infer content from context

work page
[57]

Figures are stored at preprint/images/figure1.png, etc.; open images_list.json to see captions

Image links may be broken. Figures are stored at preprint/images/figure1.png, etc.; open images_list.json to see captions

work page
[58]

Do not try to read every file in code/ -- focus on the files that reviewers explicitly reference

work page
[59]

Do not be lenient on one reviewer and strict on another

Apply the same significance bar consistently across all reviewers. Do not be lenient on one reviewer and strict on another

work page
[60]

Weaknesses

Your judgment must be independent of who wrote the review. Do not infer reviewer identity (human/AI) from writing style. Figure 14: Meta-reviewer prompt (Part 4 of 4): verification checklist before finishing, filesystem and access boundaries (including the domain blocklist that prevents the agent from retrieving the published version of the paper), the te...

work page arXiv 2025

[1] [1]

Any incorrect or unsupported criticism will undermine the credibility of your review

Your review must be factually correct: Your claims will be checked by domain experts. Any incorrect or unsupported criticism will undermine the credibility of your review. When uncertain, avoid speculation

work page

[2] [2]

Do not focus on minor or cosmetic issues

Your review must consist of only significant issues: Only point out problems that meaningfully affect the paper’s validity, soundness, methodology, claims, or reproducibility. Do not focus on minor or cosmetic issues. If you think there are less than five significant issues, then you should output less than five items (even zero items are allowed if there...

work page

[3] [3]

Specifically, mention the contextual background of what the authors attempted to do, and why that was not sufficient when comparing to common practices in the field

Your review must be concise and only criticize at most five major aspects with detailed evidence: Each criticism must be supported with detailed evidence. Specifically, mention the contextual background of what the authors attempted to do, and why that was not sufficient when comparing to common practices in the field. ### Rules for constructing each item

work page

[4] [4]

Each item consists of exactly two components: a claim and evidence

work page

[5] [5]

In the claim, you must clearly state: a

The claim is the criticism itself. In the claim, you must clearly state: a. What you are criticizing the paper for. b. On which evaluation criterion or criteria the criticism is based. c. Which component of the paper the criticism refers to

work page

[6] [6]

You should quote: a

The evidence must directly support the claim. You should quote: a. Exact sentences from the main paper or supplementary materials. b. Exact code blocks or functions from the paper’s code. c. Exact sentences from papers in the literature (hyperlinked and cited)

work page

[7] [7]

At the end of the review, include a citation list containing all literature references used in your evidence

work page

[8] [8]

It must contain at most five items and a citation list

The review must not include an introduction, summary, or concluding remarks. It must contain at most five items and a citation list

work page

[9] [9]

All output must be valid markdown

work page

[10] [10]

You must separate each item with a blank line

work page

[11] [11]

Limitations

Try to avoid using what the paper listed in the "Limitations" or "Future work" section as your claim unless it is a significant issue

work page

[12] [12]

The items should be sorted by their importance

work page

[13] [13]

Figure 6: Reviewer prompt (Part 1 of 3): task description, principles, and rules for constructing each review item

Use the format Item 1, Item 2, ..., with no fraction or denominator. Figure 6: Reviewer prompt (Part 1 of 3): task description, principles, and rules for constructing each review item. ### Required structure and format of each item Each item must be formatted exactly as follows: ## Item N: <short title summarizing the criticism> #### Claim * Main point of...

work page

[14] [14]

<citation 1> (hyperlinked to the retrieved literature)

work page

[15] [15]

<citation 2> (hyperlinked to the retrieved literature)

work page

[16] [16]

### Evaluation criteria (ordered by importance)

<citation 3> (hyperlinked to the retrieved literature) There should be at least five citations in the citation list. ### Evaluation criteria (ordered by importance)

work page

[17] [17]

Validity: Does the manuscript have significant flaws which should prohibit its publication?

work page

[18] [18]

Conclusions: Are the conclusions and data interpretation robust, valid and reliable?

work page

[19] [19]

Originality and significance: Are the results presented of immediate interest to many people in the field of study, and/or to people from several disciplines?

work page

[20] [20]

Data and methodology: Is the reporting of data and methodology sufficiently detailed and transparent to enable reproducing the results?

work page

[21] [21]

Appropriate use of statistics and treatment of uncertainties: Are all error bars defined in the corresponding figure legends and are all statistical tests appropriate and the description of any error bars and probability values accurate?

work page

[22] [22]

Figure 7: Reviewer prompt (Part 2 of 3): required output format for each review item and the citation list, and the six Nature evaluation criteria ordered by priority

Clarity and context: Is the abstract clear, accessible? Are abstract, introduction and conclusions appropriate? Note that earlier evaluation criteria should be prioritized over later ones when deciding the items in the review. Figure 7: Reviewer prompt (Part 2 of 3): required output format for each review item and the citation list, and the six Nature eva...

work page

[23] [23]

Check it before trying to run the code

The code may include a README file that explains the purpose of the code and how to run it. Check it before trying to run the code

work page

[24] [24]

If the code is not executable, try to resolve dependencies, download the necessary datasets, and run the code to validate your claims

work page

[25] [25]

### Guidelines for retrieving literature

Do not try to run the code if it is non-executable or resource-prohibitive. ### Guidelines for retrieving literature

work page

[26] [26]

Determine which papers are most relevant

Do not iterate through all the papers included in the paper’s references. Determine which papers are most relevant

work page

[27] [27]

Be proactive and add search queries during the review process

work page

[28] [28]

It is recommended not only to retrieve academic papers, but also blog posts, news articles, datasets, and code repositories

work page

[29] [29]

### Tips

Ensure you actually read what you retrieved. ### Tips

work page

[30] [30]

Do not assume the paper is incorrect solely because of OCR mistakes

The paper’s markdown may contain OCR errors. Do not assume the paper is incorrect solely because of OCR mistakes. Do not point out that the manuscript is incomplete due to formatting issues

work page

[31] [31]

Do not point out broken or missing figure assets

Image filenames are guaranteed to be figure1.png, figure2.png, etc. Do not point out broken or missing figure assets

work page

[32] [32]

The code you are reviewing does not need to be perfect; focus on major issues such as non-reproducible experiments or mismatches with descriptions rather than minor issues

work page

[33] [33]

Which human reviewer do you think provided the best quality review overall?

When refining your review, ensure that all items are factually correct, significant, and mutually exclusive. Figure 8: Reviewer prompt (Part 3 of 3): task workflow, guidelines for reading the paper and its code, guidelines for retrieving literature, and additional tips. Figure 9: The annotation sheet presented to each domain scientist.(Left) item-level an...

work page 2025

[34] [34]

the authors should discuss this limitation

The main added value is the specific Ali et al. (2025) reference demonstrating synergistic effects with quantified RRs, which provides stronger empirical backing than the first reviewer offered for the same point. P60 · Claude 4.5 · item 2 ·secondary This is essentially the same critique as the AI reviewer 1 Item 2, which already identified that removing ...

work page 2025

[35] [35]

Correct + Sig. + Evi. Sufficient 277 (30.5%)

work page

[36] [36]

Correct + Sig. + Evi. Not Suff. 2 (0.2%)

work page

[37] [37]

Correct + Sig. + Evi. Disagree 20 (2.2%)

work page

[38] [38]

Correct + Marg. Sig. + Evi. Sufficient 74 (8.1%)

work page

[39] [39]

Correct + Marg. Sig. + Evi. Not Suff. 4 (0.4%)

work page

[40] [40]

Correct + Marg. Sig. + Evi. Disagree 13 (1.4%)

work page

[41] [41]

Correct + Not Significant 55 (6.1%)

work page

[42] [42]

Disagree 298 (32.8%)

Correct + Sig. Disagree 298 (32.8%)

work page

[43] [43]

Not Correct 36 (4.0%)

work page

[44] [44]

Sig.” = both significant; “marg

Disagree on Correctness 129 (14.2%) Table 46:Calibration set statistics.Each of 908 review items from 27 dual-annotated papers carries a 10-class ground truth label encoding both the cascade outcome (correctness → significance → evidence) and inter-annotator agreement. “Sig.” = both significant; “marg.” = both marginally significant. Theprimary settingis ...

work page 2024

[45] [45]

Judge the item along three axes: correctness, significance, and evidence sufficiency (your own meta-review judgment)

work page

[46] [46]

You are NOT writing a new review

Predict how two independent expert meta-reviewers would jointly judge the item, expressed as one of 10 collapsed class labels that encode both the cascade outcome and inter-expert agreement. You are NOT writing a new review. You are judging existing review items by verifying their claims against the paper. ### Paper location The paper’s source files are a...

work page

[47] [47]

If the reviewer’s core concern is valid even though one specific supporting claim is inaccurate, the item is still Correct

Correctness -- judge the main point, not peripheral details. If the reviewer’s core concern is valid even though one specific supporting claim is inaccurate, the item is still Correct. Only mark Not Correct when the main point itself is wrong

work page

[48] [48]

would this improve the paper?

Significance -- the bar is "would this improve the paper?" Any criticism that would genuinely help the paper if addressed is Significant -- it does NOT need to threaten the paper’s validity. Missing statistics, undefined figure annotations, unreported methodological details, internal inconsistencies between text and figures, and missing ablations are typi...

work page

[49] [49]

Correct". Continue to Step 3. - Main point itself factually wrong? ->

Evidence -- verifiability, not exhaustiveness. If a meta-reviewer can verify the reviewer’s claim from what the reviewer wrote plus the paper, the evidence is Sufficient. When the reviewer’s criticism IS that something is absent, identifying the specific absence IS the evidence. Reserve Requires More for cases where the meta-reviewer cannot even locate wh...

work page

[50] [50]

Read the file back and verify it is valid JSON (no syntax errors, no trailing commas, no truncated content)

work page

[51] [51]

Count reviewers: number of reviewer entries must match number of .md files in the reviews/ directory

work page

[52] [52]

Count items per reviewer: number of item entries must match the number of "## Item" sections in that reviewer’s .md file

work page

[53] [53]

Correct" or

Check label strings: every correctness value must be exactly "Correct" or "Not Correct"; every prediction must be one of the 10 valid strings (listed below)

work page

[54] [54]

Check consistency: prediction must agree with axis labels

work page

[55] [55]

Verification complete. All [N] reviewers and [M] total items included. Prediction written to {output_file}

If any check fails, fix the file and re-verify. Only after all checks pass, print: "Verification complete. All [N] reviewers and [M] total items included. Prediction written to {output_file}" Then stop. ### Filesystem boundaries - READ from {paper_preprint_dir} and {paper_reviews_dir}. These are the paper’s source files. Do not modify anything there. - WR...

work page

[56] [56]

Do not penalize reviewers for pointing out things that are actually OCR artifacts; infer content from context

The paper’s markdown may contain OCR errors. Do not penalize reviewers for pointing out things that are actually OCR artifacts; infer content from context

work page

[57] [57]

Figures are stored at preprint/images/figure1.png, etc.; open images_list.json to see captions

Image links may be broken. Figures are stored at preprint/images/figure1.png, etc.; open images_list.json to see captions

work page

[58] [58]

Do not try to read every file in code/ -- focus on the files that reviewers explicitly reference

work page

[59] [59]

Do not be lenient on one reviewer and strict on another

Apply the same significance bar consistently across all reviewers. Do not be lenient on one reviewer and strict on another

work page

[60] [60]

Weaknesses

Your judgment must be independent of who wrote the review. Do not infer reviewer identity (human/AI) from writing style. Figure 14: Meta-reviewer prompt (Part 4 of 4): verification checklist before finishing, filesystem and access boundaries (including the domain blocklist that prevents the agent from retrieving the published version of the paper), the te...

work page arXiv 2025