pith. sign in

arxiv: 2605.20668 · v1 · pith:KZGX6Y72new · submitted 2026-05-20 · 💻 cs.CL · cs.AI· cs.LG

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

Pith reviewed 2026-05-21 05:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords AI peer reviewscientific publishinglarge language modelsexpert annotationreview qualityNature journals
0
0 comments X

The pith

An AI reviewer powered by GPT-5.2 outperforms the top human reviewer on a composite of correctness, significance, and evidence for Nature-family papers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares AI-generated and human-written reviews of 82 Nature-family papers by having 45 domain scientists rate 2,960 individual criticisms on three dimensions. Experts spent 469 hours producing these ratings. Results show the best AI system beats the highest-rated human reviewer on the composite score while all tested AI systems beat the lowest-rated human across every dimension. AI reviews also raise some unique issues but share more overlap with each other and display recurring weaknesses absent in human reviews.

Core claim

Through expert annotation of criticisms from reviews of Nature-family papers, the work shows that an AI agent using GPT-5.2 achieves a 60.0 percent composite score on correctness, significance, and evidential sufficiency, exceeding the 48.2 percent of each paper's top human reviewer. All three AI reviewers surpass the lowest human reviewer on every dimension, their accurate criticisms tend to be rated more significant and well-evidenced, and they identify a distinct 26 percent of issues no human raises, yet AI reviewers overlap more than humans do and share 16 specific weaknesses.

What carries the argument

Expert ratings of individual review criticisms on correctness, significance, and sufficiency of evidence.

If this is right

  • AI reviewers can complement humans by surfacing issues that experts miss.
  • Peer review systems could integrate AI to increase coverage of paper aspects.
  • AI development should target the identified weaknesses such as limited subfield knowledge and long-context handling.
  • Current AI systems are positioned as assistants rather than replacements for human reviewers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid human-AI review workflows might combine unique strengths to improve overall quality.
  • Testing AI reviewers on live submissions rather than post-publication reviews would reveal practical performance.
  • Extending the evaluation to other journal families could show whether the pattern holds beyond high-profile outlets.

Load-bearing premise

The 45 domain scientists provide unbiased and reliable ratings of the criticisms without systematic inter-rater differences or selection effects from the chosen papers.

What would settle it

New experts independently rating the same 2,960 criticisms and producing no significant difference between the GPT-5.2 agent and the top human reviewer would falsify the outperformance result.

Figures

Figures reproduced from arXiv: 2605.20668 by Akari Asai, Aleksandar Shulevski, Alice Oh, Amanda Montoya, Arthur Porto, Biljana Mitreska, Biljana Mojsoska, Carolin Lawrence, Changwon Yoon, Christian Langkammer, Chungwoo Lee, Daniel R. Schrider, Dongkeun Yoon, Dragana Manasova, Drew Bridges, Edward Choi, Elly Knight, Ergun Simsek, Esther H. R. Tsai, Francesco Santini, Graham Neubig, Heera Moon, Henrik Christiansen, Huishan Li, Hyunjoo Jenny Lee, Hyun Uk Kim, Ian Wu, Ilija Dukovski, Ishraq Md Anjum, Jae Kyoung Kim, Jeongyoun Ahn, Jihye Park, Jinheon Baek, Junhan Kim, Juyoung Suk, Khushboo Shafi, Kiril Gashteovski, Kyeongha Kwon, Makoto Takamoto, Marko Shuntov, Mooseok Jang, Nikola Stikov, Niyazi Ulas Din\c{c}, Pranjal Aggarwal, Ruoqi Liu, Sean Welleck, Seungone Kim, Spase Petkoski, Sunkyu Han, Viktor Zaverkin, Woo Youn Kim, Xiang Yue, Yehhyun Jo, Yeonseung Chung, Yeon Sik Jung, Yong Jeong, Yoosang Son, Young Min Sim.

Figure 1
Figure 1. Figure 1: Illustration of the motivation behind our expert annotation study. Given a human-written review and an AI-generated review based on the same academic paper, prior works used shallow heuristics such as score correlation and acceptance matching to determine the quality of the AI-generated review. However, producing similar scores or matching accept/reject recommendations doesn’t indicate that the AI-generate… view at source ↗
Figure 2
Figure 2. Figure 2: Two example review items written by the same human reviewer of a paper in the Physical Sciences. Each [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of a review item produced by an AI reviewer for the same paper as [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: AI reviewers overlap with each other much more than humans do, while AI panels match most human targets but only about half of the specific criticisms. (Left) Distribution of cross-reviewer item pairs across the four similarity categories, for Human-Human, Human-AI, and AI-AI pair types. (Right) Fraction of one reviewer’s items covered by another at three progressively stricter similarity thresholds: at le… view at source ↗
Figure 5
Figure 5. Figure 5: Strengths and weaknesses of AI reviewers identified by domain experts. Distribution of 442 free-form comments on AI reviews across 16 weakness categories (left, n = 260) and 6 strength categories (right, n = 132). Dark bars are item-level comments; light bars are paper-level comments. Categories are sorted by total count. 5.1 Failure cases: Limitations of AI reviewers Overview The five most frequently cite… view at source ↗
Figure 6
Figure 6. Figure 6: Reviewer prompt (Part 1 of 3): task description, principles, and rules for constructing each review item. [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reviewer prompt (Part 2 of 3): required output format for each review item and the citation list, and the [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Reviewer prompt (Part 3 of 3): task workflow, guidelines for reading the paper and its code, guidelines for [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The annotation sheet presented to each domain scientist. [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of the motivation behind the similarity analysis. For each paper in our expert-annotation study, we obtain six reviews: three from human reviewers and three from AI reviewers (top panel). To quantify how similar any two reviews are (e.g., human–human, human–AI, or AI–AI), we compare every review item in one review against every review item in the other and classify each item pair into one of … view at source ↗
Figure 11
Figure 11. Figure 11: Meta-reviewer prompt (Part 1 of 4): role, paper context, and the three principles that establish the bar for [PITH_FULL_IMAGE:figures/full_fig_p084_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Meta-reviewer prompt (Part 2 of 4): the per-item decision procedure, with Part A producing the agent’s [PITH_FULL_IMAGE:figures/full_fig_p085_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Meta-reviewer prompt (Part 3 of 4): consistency constraint linking the predicted ten-class label to the [PITH_FULL_IMAGE:figures/full_fig_p086_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Meta-reviewer prompt (Part 4 of 4): verification checklist before finishing, filesystem and access [PITH_FULL_IMAGE:figures/full_fig_p087_14.png] view at source ↗
read the original abstract

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper describes a large-scale annotation study in which 45 domain scientists in the Physical, Biological, and Health Sciences spent 469 hours evaluating 2,960 individual criticisms extracted from human and AI-generated reviews of 82 Nature-family papers. Each criticism was rated on three dimensions: correctness, significance, and sufficiency of evidence. The results indicate that AI-powered reviewers, particularly one based on GPT-5.2, achieve higher composite scores than the best human reviewer for each paper (60.0% versus 48.2%, with p = 0.009). Additionally, all AI reviewers outperform the lowest-rated human reviewers across all dimensions, identify a distinct set of issues (26% unique), but demonstrate greater overlap among their reviews (21% vs. 3% for humans) and share 16 recurring weaknesses not observed in human reviews, such as limited subfield knowledge and difficulties with long context.

Significance. If the expert ratings prove reliable, the findings offer valuable insights into the capabilities and limitations of AI in peer review. The scale of the study, with hundreds of hours of expert input and thousands of ratings, strengthens the evidence that AI can serve as a complement to human reviewers by surfacing unique issues. This has implications for improving the efficiency and thoroughness of scientific peer review processes.

major comments (3)
  1. [Methods] Methods: The annotation protocol for the 45 domain scientists, including how the 2,960 criticisms were selected, presented to raters, and any training or calibration procedures, is not described in sufficient detail. This is load-bearing for the central claim because the composite scores and the 60.0% vs 48.2% comparison (p=0.009) rest entirely on these ratings as ground truth.
  2. [Results] Results: No inter-rater agreement statistics (such as Fleiss' kappa, percentage agreement, or overlap metrics across the 45 raters) are reported for the ratings on correctness, significance, and evidential sufficiency. Systematic differences in rater leniency or interpretation could directly alter which criticisms are scored highly and shift the per-paper top-human baseline.
  3. [Methods] Methods: The selection criteria and process for the 82 Nature-family papers are not elaborated, including any stratification by field or controls for potential biases. This affects the generalizability of the finding that AI exceeds top humans on the composite metric.
minor comments (2)
  1. [Abstract] Abstract: Clarify whether 'GPT-5.2' refers to a specific deployed model or a hypothetical/future version, as this affects interpretation of the performance numbers.
  2. [Results] Results: Consider adding a table breaking down the three individual dimensions (correctness, significance, sufficiency) for AI vs. human reviewers to support the composite score claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which help us improve the clarity and rigor of our manuscript. We address each major comment below and have made revisions to incorporate additional details where appropriate.

read point-by-point responses
  1. Referee: [Methods] Methods: The annotation protocol for the 45 domain scientists, including how the 2,960 criticisms were selected, presented to raters, and any training or calibration procedures, is not described in sufficient detail. This is load-bearing for the central claim because the composite scores and the 60.0% vs 48.2% comparison (p=0.009) rest entirely on these ratings as ground truth.

    Authors: We agree that a more detailed description of the annotation protocol is warranted to support the reliability of the ratings. In the revised manuscript, we will expand the Methods section with a step-by-step account of how the 2,960 criticisms were extracted from the reviews, the criteria used for their selection and presentation to raters, the structure of the annotation interface, and the training and calibration procedures provided to the 45 domain scientists. This will include examples of rating guidelines and interface screenshots to enhance reproducibility. revision: yes

  2. Referee: [Results] Results: No inter-rater agreement statistics (such as Fleiss' kappa, percentage agreement, or overlap metrics across the 45 raters) are reported for the ratings on correctness, significance, and evidential sufficiency. Systematic differences in rater leniency or interpretation could directly alter which criticisms are scored highly and shift the per-paper top-human baseline.

    Authors: We acknowledge the importance of reporting inter-rater agreement to assess rating reliability. We have calculated Fleiss' kappa and percentage agreement for each of the three rating dimensions across the 45 raters and will add these statistics, along with a brief discussion of their implications, to the revised Results section. This addition will help address concerns about potential systematic differences in rater leniency. revision: yes

  3. Referee: [Methods] Methods: The selection criteria and process for the 82 Nature-family papers are not elaborated, including any stratification by field or controls for potential biases. This affects the generalizability of the finding that AI exceeds top humans on the composite metric.

    Authors: We agree that explicit details on paper selection are needed for evaluating generalizability. In the revised Methods section, we will provide a full description of the selection criteria for the 82 Nature-family papers, including stratification by field (Physical, Biological, and Health Sciences), the sampling procedure, time period, and any measures taken to control for biases such as paper length or topic distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims rest on a new empirical study in which 45 independent domain scientists provided 2,960 ratings of individual criticisms across correctness, significance, and evidential sufficiency for both human and AI reviews of 82 papers. These external expert judgments serve as ground truth for the composite score comparisons (e.g., 60.0% vs 48.2%), with no equations, fitted parameters, self-definitional constructs, or load-bearing self-citations that reduce the reported results to the authors' prior inputs by construction. The methodology is self-contained against the collected annotations and does not invoke uniqueness theorems or ansatzes from the authors' own previous work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical human-annotation study that treats expert scientist ratings as the evaluation standard. It introduces no new theoretical entities, fitted parameters, or ad-hoc axioms beyond standard assumptions of annotation studies.

axioms (1)
  • domain assumption Domain experts can reliably judge the correctness, significance, and evidential support of individual review criticisms.
    The entire comparative analysis depends on the 45 annotators' scores serving as valid ground truth.

pith-pipeline@v0.9.0 · 6153 in / 1382 out tokens · 69744 ms · 2026-05-21T05:19:53.377566+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

  1. [1]

    Any incorrect or unsupported criticism will undermine the credibility of your review

    Your review must be factually correct: Your claims will be checked by domain experts. Any incorrect or unsupported criticism will undermine the credibility of your review. When uncertain, avoid speculation

  2. [2]

    Do not focus on minor or cosmetic issues

    Your review must consist of only significant issues: Only point out problems that meaningfully affect the paper’s validity, soundness, methodology, claims, or reproducibility. Do not focus on minor or cosmetic issues. If you think there are less than five significant issues, then you should output less than five items (even zero items are allowed if there...

  3. [3]

    Specifically, mention the contextual background of what the authors attempted to do, and why that was not sufficient when comparing to common practices in the field

    Your review must be concise and only criticize at most five major aspects with detailed evidence: Each criticism must be supported with detailed evidence. Specifically, mention the contextual background of what the authors attempted to do, and why that was not sufficient when comparing to common practices in the field. ### Rules for constructing each item

  4. [4]

    Each item consists of exactly two components: a claim and evidence

  5. [5]

    In the claim, you must clearly state: a

    The claim is the criticism itself. In the claim, you must clearly state: a. What you are criticizing the paper for. b. On which evaluation criterion or criteria the criticism is based. c. Which component of the paper the criticism refers to

  6. [6]

    You should quote: a

    The evidence must directly support the claim. You should quote: a. Exact sentences from the main paper or supplementary materials. b. Exact code blocks or functions from the paper’s code. c. Exact sentences from papers in the literature (hyperlinked and cited)

  7. [7]

    At the end of the review, include a citation list containing all literature references used in your evidence

  8. [8]

    It must contain at most five items and a citation list

    The review must not include an introduction, summary, or concluding remarks. It must contain at most five items and a citation list

  9. [9]

    All output must be valid markdown

  10. [10]

    You must separate each item with a blank line

  11. [11]

    Limitations

    Try to avoid using what the paper listed in the "Limitations" or "Future work" section as your claim unless it is a significant issue

  12. [12]

    The items should be sorted by their importance

  13. [13]

    Figure 6: Reviewer prompt (Part 1 of 3): task description, principles, and rules for constructing each review item

    Use the format Item 1, Item 2, ..., with no fraction or denominator. Figure 6: Reviewer prompt (Part 1 of 3): task description, principles, and rules for constructing each review item. ### Required structure and format of each item Each item must be formatted exactly as follows: ## Item N: <short title summarizing the criticism> #### Claim * Main point of...

  14. [14]

    <citation 1> (hyperlinked to the retrieved literature)

  15. [15]

    <citation 2> (hyperlinked to the retrieved literature)

  16. [16]

    ### Evaluation criteria (ordered by importance)

    <citation 3> (hyperlinked to the retrieved literature) There should be at least five citations in the citation list. ### Evaluation criteria (ordered by importance)

  17. [17]

    Validity: Does the manuscript have significant flaws which should prohibit its publication?

  18. [18]

    Conclusions: Are the conclusions and data interpretation robust, valid and reliable?

  19. [19]

    Originality and significance: Are the results presented of immediate interest to many people in the field of study, and/or to people from several disciplines?

  20. [20]

    Data and methodology: Is the reporting of data and methodology sufficiently detailed and transparent to enable reproducing the results?

  21. [21]

    Appropriate use of statistics and treatment of uncertainties: Are all error bars defined in the corresponding figure legends and are all statistical tests appropriate and the description of any error bars and probability values accurate?

  22. [22]

    Figure 7: Reviewer prompt (Part 2 of 3): required output format for each review item and the citation list, and the six Nature evaluation criteria ordered by priority

    Clarity and context: Is the abstract clear, accessible? Are abstract, introduction and conclusions appropriate? Note that earlier evaluation criteria should be prioritized over later ones when deciding the items in the review. Figure 7: Reviewer prompt (Part 2 of 3): required output format for each review item and the citation list, and the six Nature eva...

  23. [23]

    Check it before trying to run the code

    The code may include a README file that explains the purpose of the code and how to run it. Check it before trying to run the code

  24. [24]

    If the code is not executable, try to resolve dependencies, download the necessary datasets, and run the code to validate your claims

  25. [25]

    ### Guidelines for retrieving literature

    Do not try to run the code if it is non-executable or resource-prohibitive. ### Guidelines for retrieving literature

  26. [26]

    Determine which papers are most relevant

    Do not iterate through all the papers included in the paper’s references. Determine which papers are most relevant

  27. [27]

    Be proactive and add search queries during the review process

  28. [28]

    It is recommended not only to retrieve academic papers, but also blog posts, news articles, datasets, and code repositories

  29. [29]

    ### Tips

    Ensure you actually read what you retrieved. ### Tips

  30. [30]

    Do not assume the paper is incorrect solely because of OCR mistakes

    The paper’s markdown may contain OCR errors. Do not assume the paper is incorrect solely because of OCR mistakes. Do not point out that the manuscript is incomplete due to formatting issues

  31. [31]

    Do not point out broken or missing figure assets

    Image filenames are guaranteed to be figure1.png, figure2.png, etc. Do not point out broken or missing figure assets

  32. [32]

    The code you are reviewing does not need to be perfect; focus on major issues such as non-reproducible experiments or mismatches with descriptions rather than minor issues

  33. [33]

    Which human reviewer do you think provided the best quality review overall?

    When refining your review, ensure that all items are factually correct, significant, and mutually exclusive. Figure 8: Reviewer prompt (Part 3 of 3): task workflow, guidelines for reading the paper and its code, guidelines for retrieving literature, and additional tips. Figure 9: The annotation sheet presented to each domain scientist.(Left) item-level an...

  34. [34]

    the authors should discuss this limitation

    The main added value is the specific Ali et al. (2025) reference demonstrating synergistic effects with quantified RRs, which provides stronger empirical backing than the first reviewer offered for the same point. P60 · Claude 4.5 · item 2 ·secondary This is essentially the same critique as the AI reviewer 1 Item 2, which already identified that removing ...

  35. [35]

    Correct + Sig. + Evi. Sufficient 277 (30.5%)

  36. [36]

    Correct + Sig. + Evi. Not Suff. 2 (0.2%)

  37. [37]

    Correct + Sig. + Evi. Disagree 20 (2.2%)

  38. [38]

    Correct + Marg. Sig. + Evi. Sufficient 74 (8.1%)

  39. [39]

    Correct + Marg. Sig. + Evi. Not Suff. 4 (0.4%)

  40. [40]

    Correct + Marg. Sig. + Evi. Disagree 13 (1.4%)

  41. [41]

    Correct + Not Significant 55 (6.1%)

  42. [42]

    Disagree 298 (32.8%)

    Correct + Sig. Disagree 298 (32.8%)

  43. [43]

    Not Correct 36 (4.0%)

  44. [44]

    Sig.” = both significant; “marg

    Disagree on Correctness 129 (14.2%) Table 46:Calibration set statistics.Each of 908 review items from 27 dual-annotated papers carries a 10-class ground truth label encoding both the cascade outcome (correctness → significance → evidence) and inter-annotator agreement. “Sig.” = both significant; “marg.” = both marginally significant. Theprimary settingis ...

  45. [45]

    Judge the item along three axes: correctness, significance, and evidence sufficiency (your own meta-review judgment)

  46. [46]

    You are NOT writing a new review

    Predict how two independent expert meta-reviewers would jointly judge the item, expressed as one of 10 collapsed class labels that encode both the cascade outcome and inter-expert agreement. You are NOT writing a new review. You are judging existing review items by verifying their claims against the paper. ### Paper location The paper’s source files are a...

  47. [47]

    If the reviewer’s core concern is valid even though one specific supporting claim is inaccurate, the item is still Correct

    Correctness -- judge the main point, not peripheral details. If the reviewer’s core concern is valid even though one specific supporting claim is inaccurate, the item is still Correct. Only mark Not Correct when the main point itself is wrong

  48. [48]

    would this improve the paper?

    Significance -- the bar is "would this improve the paper?" Any criticism that would genuinely help the paper if addressed is Significant -- it does NOT need to threaten the paper’s validity. Missing statistics, undefined figure annotations, unreported methodological details, internal inconsistencies between text and figures, and missing ablations are typi...

  49. [49]

    Correct". Continue to Step 3. - Main point itself factually wrong? ->

    Evidence -- verifiability, not exhaustiveness. If a meta-reviewer can verify the reviewer’s claim from what the reviewer wrote plus the paper, the evidence is Sufficient. When the reviewer’s criticism IS that something is absent, identifying the specific absence IS the evidence. Reserve Requires More for cases where the meta-reviewer cannot even locate wh...

  50. [50]

    Read the file back and verify it is valid JSON (no syntax errors, no trailing commas, no truncated content)

  51. [51]

    Count reviewers: number of reviewer entries must match number of .md files in the reviews/ directory

  52. [52]

    Count items per reviewer: number of item entries must match the number of "## Item" sections in that reviewer’s .md file

  53. [53]

    Correct" or

    Check label strings: every correctness value must be exactly "Correct" or "Not Correct"; every prediction must be one of the 10 valid strings (listed below)

  54. [54]

    Check consistency: prediction must agree with axis labels

  55. [55]

    Verification complete. All [N] reviewers and [M] total items included. Prediction written to {output_file}

    If any check fails, fix the file and re-verify. Only after all checks pass, print: "Verification complete. All [N] reviewers and [M] total items included. Prediction written to {output_file}" Then stop. ### Filesystem boundaries - READ from {paper_preprint_dir} and {paper_reviews_dir}. These are the paper’s source files. Do not modify anything there. - WR...

  56. [56]

    Do not penalize reviewers for pointing out things that are actually OCR artifacts; infer content from context

    The paper’s markdown may contain OCR errors. Do not penalize reviewers for pointing out things that are actually OCR artifacts; infer content from context

  57. [57]

    Figures are stored at preprint/images/figure1.png, etc.; open images_list.json to see captions

    Image links may be broken. Figures are stored at preprint/images/figure1.png, etc.; open images_list.json to see captions

  58. [58]

    Do not try to read every file in code/ -- focus on the files that reviewers explicitly reference

  59. [59]

    Do not be lenient on one reviewer and strict on another

    Apply the same significance bar consistently across all reviewers. Do not be lenient on one reviewer and strict on another

  60. [60]

    Weaknesses

    Your judgment must be independent of who wrote the review. Do not infer reviewer identity (human/AI) from writing style. Figure 14: Meta-reviewer prompt (Part 4 of 4): verification checklist before finishing, filesystem and access boundaries (including the domain blocklist that prevents the agent from retrieving the published version of the paper), the te...