pith. sign in

arxiv: 2605.16623 · v1 · pith:23UZLL5Xnew · submitted 2026-05-15 · 💻 cs.CY · cs.AI

To Trust or Not to Trust: Authors' Response to AI-based Reviews

Pith reviewed 2026-05-19 20:43 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords AI peer reviewauthor perceptionslarge language modelstrust in AIpeer review processcomputer science conferencesscholarly publishing
0
0 comments X

The pith

Authors at CS venues found AI-based reviews useful enough to use in revisions, though they trusted them less than human reviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are increasingly considered for assisting with scholarly peer review, but data on how authors actually respond to AI feedback is limited. This paper describes two pilot studies at computer science conferences in which authors received AI-generated auxiliary reviews and then completed an anonymous questionnaire. Most of the 56 respondents rated the AI review as useful, said it caught problems missed by human reviewers, and incorporated some of its suggestions into their camera-ready versions. Authors still expressed lower trust in the AI output than in human reviews and preferred that any use of AI come with advance notice and explicit consent. The findings position AI reviews as a supplementary aid rather than a replacement for human judgment.

Core claim

The paper establishes that in surveys of authors who received both human and AI-based reviews, 83.9% found the AI review useful, 80.4% reported it identified issues not mentioned by humans, and 82.1% used at least some AI feedback in their camera-ready version, while generally trusting it less and wanting it used only with notice and consent.

What carries the argument

Anonymous post-review questionnaire with closed-ended items and open-ended responses, summarized by descriptive statistics and inductive thematic analysis.

If this is right

  • 96.4% of authors said they would use AI as an internal review tool before future submissions.
  • 89.3% of authors prefer advance notice that AI will be used in review.
  • 76.8% of authors favor explicit consent before AI is used in the review process.
  • Problems with AI reviews were mostly described as minor inaccuracies rather than clearly incorrect or misleading comments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If venues adopt AI reviews routinely, authors may begin preparing submissions with an eye toward what automated tools can flag.
  • Hybrid review models could pair AI suggestions with human oversight to address cases where some human reviews were viewed as not very useful.
  • Longer-term studies could check whether incorporating AI feedback actually improves final paper quality or shortens the overall review cycle.

Load-bearing premise

The 56 self-selected responses from authors at two specific computer science venues are representative of broader author perceptions of AI-based reviews.

What would settle it

A larger survey across more venues that uses random sampling or tracks response rates and finds substantially lower rates of perceived usefulness or actual usage of AI feedback would challenge the central claims.

read the original abstract

Large language models are increasingly discussed and used as tools that may assist with scholarly peer review, but empirical evidence regarding how authors use and perceive AI-based feedback remains limited. This paper reports findings from two independent pilot studies on authors' use and perceptions of AI-based auxiliary review at two computer science venues. After the review release, authors were invited to complete an anonymous post-review questionnaire about the AI review's usefulness, trustworthiness, agreement with human reviews, practical value for revision, perceived inaccuracies, and consent. The final dataset included 56 analyzable responses from authors of 40 papers; closed-ended items were summarized using descriptive statistics, and open-ended responses were analyzed using inductive thematic analysis. Most respondents (83.9%) considered the AI-based review useful, and 80.4% reported that it identified issues not mentioned by human reviewers. This perceived added value translated into action: 82.1% reported using at least some AI feedback in their camera-ready version. However, the authors did not treat the AI review as equivalent to a human review. They generally trusted it less than the human reviews and found human feedback clearer, even though 25.0% described at least some human reviews as not very useful. Reported problems with the AI review were usually limited: 51.8% reported minor inaccuracies, while 16.1% reported clearly incorrect, misleading, or irrelevant comments. Support for future use was strongest when AI was framed as a supervised or author-controlled tool: 96.4% said they would use AI as an internal review tool before future submissions, 89.3% preferred advance notice that AI would be used in review, and 76.8% favored explicit consent before use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports findings from two pilot studies surveying authors at two computer science venues on their perceptions and use of AI-based auxiliary reviews. Using 56 anonymous questionnaire responses from authors of 40 papers, it presents descriptive statistics showing that most respondents (83.9%) found the AI review useful, 80.4% said it identified issues not mentioned by human reviewers, and 82.1% used at least some AI feedback in their camera-ready version. Authors trusted AI reviews less than human ones but expressed strong support for AI as a supervised internal tool.

Significance. If the reported perceptions hold, this provides timely empirical data on author attitudes toward AI assistance in peer review, an area with limited prior evidence. The mixed-methods design (descriptive statistics on closed items plus inductive thematic analysis on open responses) offers concrete insights into perceived added value, trust differentials, and preferences for consent and supervision that could inform venue policies.

major comments (2)
  1. [Methods / Data Collection] The manuscript states that the final dataset included 56 analyzable responses from authors of 40 papers but does not report the total number of authors invited or the response rate. This is load-bearing for interpreting the headline proportions (e.g., 83.9% usefulness, 80.4% new issues identified) because self-selection bias cannot be assessed without these figures or a non-response analysis.
  2. [Study Design] No details are provided on how the AI-based reviews were generated or standardized (model, prompt template, length, or calibration across papers). This affects interpretation of the reported inaccuracy rates (51.8% minor, 16.1% clearly incorrect) and the claim that AI identified issues missed by humans.
minor comments (2)
  1. [Abstract] The abstract and results would benefit from explicit mention of the two venues and submission years to allow readers to assess temporal and disciplinary context.
  2. [Discussion] A dedicated limitations paragraph discussing the self-selected sample and lack of power analysis would strengthen the manuscript even if the study is framed as a pilot.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us improve the clarity and transparency of the manuscript. We have prepared a revised version that incorporates additional methodological details as suggested. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Methods / Data Collection] The manuscript states that the final dataset included 56 analyzable responses from authors of 40 papers but does not report the total number of authors invited or the response rate. This is load-bearing for interpreting the headline proportions (e.g., 83.9% usefulness, 80.4% new issues identified) because self-selection bias cannot be assessed without these figures or a non-response analysis.

    Authors: We agree that the response rate is important for assessing potential self-selection bias. In the revised manuscript we now report the total number of authors invited to the survey at each venue and the resulting overall response rate. We have also added a brief limitations paragraph discussing the implications of the achieved response rate for generalizability in this pilot study. revision: yes

  2. Referee: [Study Design] No details are provided on how the AI-based reviews were generated or standardized (model, prompt template, length, or calibration across papers). This affects interpretation of the reported inaccuracy rates (51.8% minor, 16.1% clearly incorrect) and the claim that AI identified issues missed by humans.

    Authors: We appreciate this observation. The revised Methods section now includes a full description of the AI review generation process: the specific model used, the standardized prompt template (reproduced verbatim), target review length, and the calibration steps applied to maintain consistency across papers. These additions provide the context needed to interpret the inaccuracy rates and the authors' perceptions of added value relative to human reviews. revision: yes

Circularity Check

0 steps flagged

Empirical survey reporting with no derivations or self-referential models

full rationale

The paper reports descriptive statistics and thematic analysis from 56 survey responses collected after AI-assisted reviews at two CS venues. All headline percentages (83.9% useful, 80.4% new issues identified, 82.1% used in camera-ready) are direct tabulations of respondent answers rather than outputs of any model, equation, or fitted parameter. No mathematical derivations, predictions, uniqueness theorems, or ansatzes appear in the manuscript. Self-citations, if present, are not load-bearing for the central claims, which rest on the primary data collection described in the methods. The study is therefore self-contained against external benchmarks and exhibits no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that self-reported usefulness and usage accurately reflect real behavior and that the small volunteer sample at two CS venues can be interpreted without major selection bias. No free parameters or invented entities appear; the only background assumptions are standard survey validity and that the AI review generation process was consistent enough for comparison.

axioms (2)
  • domain assumption Self-selected respondents who chose to answer the post-review questionnaire are not systematically different from non-respondents in their views of AI reviews.
    The paper reports 56 analyzable responses but provides no response rate or non-response analysis in the abstract.
  • domain assumption The AI reviews were generated under comparable conditions across papers so that author perceptions can be aggregated.
    The abstract does not detail the exact AI system, prompt, or review length used.

pith-pipeline@v0.9.0 · 5847 in / 1544 out tokens · 26525 ms · 2026-05-19T20:43:00.069733+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

  1. [1]

    https://aaai

    AAAI (2025) AAAI launches AI-powered peer review assessment system. https://aaai. org/aaai-launches-ai-powered-peer-review-assessment-system/, accessed 2026-03-26

  2. [2]

    Proceedings of the National Academy of Sciences 122(5):e2401232121

    Aczel B, Barwich AS, Diekman AB, et al (2025) The present and future of peer review: Ideas, interventions, and evidence. Proceedings of the National Academy of Sciences 122(5):e2401232121

  3. [3]

    Frontiers in Artificial Intelligence 8:1622292

    Anh-Hoang D, Tran V, Nguyen LM (2025) Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior. Frontiers in Artificial Intelligence 8:1622292

  4. [4]

    Bhavsar D, Duffy L, Jo H, et al (2025) Policies on artificial intelligence chatbots among academicpublishers:across-sectionalaudit.ResearchIntegrityandPeerReview10(1):1

  5. [5]

    Qualitative research in psychology 3(2):77–101

    Braun V, Clarke V (2006) Using thematic analysis in psychology. Qualitative research in psychology 3(2):77–101

  6. [6]

    Humanities and social sciences communications 8(1):25

    Checco A, Bracciale L, Loreti P, et al (2021) Ai-assisted peer review. Humanities and social sciences communications 8(1):25

  7. [7]

    Assessment & Evaluation in Higher Education 49(1):1–12

    Chong SW, Lin T (2024) Feedback practices in journal peer-review: a systematic literature review. Assessment & Evaluation in Higher Education 49(1):1–12

  8. [8]

    https://cmp.felk.cvut.cz/cvww2026/index.html, accessed 2026-04-28

    CVWW organizing committee (2026) Computer Vision Winter Workshop 2026 (CVWW 2026). https://cmp.felk.cvut.cz/cvww2026/index.html, accessed 2026-04-28

  9. [9]

    Research integrity and peer review 3(1):8

    Horbach SP, Halffman W (2018) The changing forms and expectations of peer review. Research integrity and peer review 3(1):8

  10. [10]

    Research integrity and peer review 8(1):4

    Hosseini M, Horbach SP (2023) Fighting reviewer fatigue or amplifying bias? consider- ations and recommendations for use of ChatGPT and other large language models in scholarly peer review. Research integrity and peer review 8(1):4

  11. [11]

    Intelligence-Based Medicine 11:100246

    Hoyt R, Limon A, Chang A (2025) Generative ai and scientific manuscript peer review. Intelligence-Based Medicine 11:100246

  12. [12]

    ACM Transactions on Information Systems 43(2):1–55

    Huang L, Yu W, Ma W, et al (2025) A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems 43(2):1–55

  13. [13]

    https://icml.cc/Conferences/2026/ LLM-Policy, accessed 2026-03-26

    ICML (2025) ICML 2025 reviewer instructions. https://icml.cc/Conferences/2026/ LLM-Policy, accessed 2026-03-26

  14. [14]

    In: European Conference on Information Retrieval, Springer, pp 373–381

    Joly A, Picek L, Kahl S, et al (2025) Lifeclef 2025 teaser: Challenges on species pres- ence prediction and identification, and individual animal identification. In: European Conference on Information Retrieval, Springer, pp 373–381

  15. [15]

    Ejifcc 25(3):227

    Kelly J, Sadeghieh T, Adeli K (2014) Peer review in scientific publications: benefits, critiques, & a survival guide. Ejifcc 25(3):227

  16. [16]

    Research integrity and peer review 4(1):19 12

    Keserlioglu K, Kilicoglu H, Ter Riet G (2019) Impact of peer review on discussion of study limitations and strength of claims in randomized trial reports: a before and after study. Research integrity and peer review 4(1):19 12

  17. [17]

    European Journal of Radiology Artificial Intelligence 2:100018

    Kocak B, Onur MR, Park SH, et al (2025) Ensuring peer review integrity in the era of large language models: A critical stocktaking of challenges, red flags, and recommendations. European Journal of Radiology Artificial Intelligence 2:100018

  18. [18]

    In: Findings of ACL 2023, URL https://openreview.net/pdf?id=a92fk8ayc5

    Kumar S, et al (2023) Finding disagreement in scientific peer reviews. In: Findings of ACL 2023, URL https://openreview.net/pdf?id=a92fk8ayc5

  19. [19]

    Research Integrity and Peer Review 8(1):3

    LeBlanc AG, Barnes JD, Saunders TJ, et al (2023) Scientific sinkhole: estimating the cost of peer review based on survey data with snowball sampling. Research Integrity and Peer Review 8(1):3

  20. [20]

    Journal of Educational Evaluation for Health Professions 22

    Lee J, Lee J, Yoo JJ (2025) The role of large language models in the peer-review process: opportunities and challenges for medical journal reviewers and editors. Journal of Educational Evaluation for Health Professions 22

  21. [21]

    Lemberger T, Mastboim NS, Rechavi O (2026) How do authors want to use ai for review? EMBO reports pp 1–5

  22. [22]

    In: Proceedings of the 41st International Conference on Machine Learning, ICML’24

    Liang W, Izzo Z, Zhang Y, et al (2024) Monitoring ai-modified content at scale: a case study on the impact of ChatGPT on ai conference peer reviews. In: Proceedings of the 41st International Conference on Machine Learning, ICML’24

  23. [23]

    https://www.imageclef.org/ LifeCLEF2025, accessed 2026-04-28

    LifeCLEF organizing committee (2025) LifeCLEF 2025. https://www.imageclef.org/ LifeCLEF2025, accessed 2026-04-28

  24. [24]

    https://www

    Nature Portfolio (2026) Artificial intelligence (ai) - editorial policies. https://www. nature.com/nature-portfolio/editorial-policies/ai, accessed 2026-03-26

  25. [25]

    https://neurips.cc/Conferences/2025/LLM, accessed 2026-03-26

    NeurIPS (2025) LLM policy. https://neurips.cc/Conferences/2025/LLM, accessed 2026-03-26

  26. [26]

    NeurIPS Blog, URL https://blog.neurips.cc/2021/12/08/ the-neurips-2021-consistency-experiment/

    NeurIPS Program Chairs (2021) The neurips 2021 consistency experiment. NeurIPS Blog, URL https://blog.neurips.cc/2021/12/08/ the-neurips-2021-consistency-experiment/

  27. [27]

    Research Integrity and Peer Review 10(1):19

    Ng JY, Krishnamurthy M, Deol G, et al (2025) Attitudes and perceptions of biomedical journal editors in chief towards the use of artificial intelligence chatbots in the schol- arly publishing process: a cross-sectional survey. Research Integrity and Peer Review 10(1):19

  28. [28]

    JAMA 334(17):1520–1522

    PerlisRH,ChristakisDA,BresslerNM,etal(2025)Artificialintelligenceinpeerreview. JAMA 334(17):1520–1522

  29. [29]

    In: Interna- tional Conference of the Cross-Language Evaluation Forum for European Languages, Springer, pp 338–362

    Picek L, Kahl S, Goëau H, et al (2025) Overview of lifeclef 2025: Challenges on species presence prediction and identification, and individual animal identification. In: Interna- tional Conference of the Cross-Language Evaluation Forum for European Languages, Springer, pp 338–362

  30. [30]

    RothwellPM,MartynCN(2000)Reproducibilityofpeerreviewinclinicalneuroscience: Is agreement between reviewers any greater than would be expected by chance alone? Brain 123(9):1964–1969

  31. [31]

    Knowledge and Information Systems 67(8):6413–6460 13

    Sizo A, Lino A, Rocha Á, et al (2025) Defining quality in peer review reports: a scoping review. Knowledge and Information Systems 67(8):6413–6460 13

  32. [32]

    BMC medical research methodology 19(1):48

    Superchi C, González JA, Solà I, et al (2019) Tools used to assess the quality of peer review reports: a methodological systematic review. BMC medical research methodology 19(1):48

  33. [33]

    Research Integrity and Peer Review 10(1):23

    Teixeira AL (2025) Ai in peer review: can artificial intelligence be an ally in reducing gender and geographical gaps in peer review? a randomized trial. Research Integrity and Peer Review 10(1):23

  34. [34]

    Research integrity and peer review 5(1):6

    Tennant JP, Ross-Hellauer T (2020) The limitations to our understanding of peer review. Research integrity and peer review 5(1):6

  35. [35]

    arXiv preprint arXiv:250409737 14 AI Prompts Prompts used to generate the ChatGPT-based reviews in our study

    Thakkar N, Yuksekgonul M, Silberg J, et al (2025) Can llm feedback enhance review quality? a randomized study of 20k reviews at iclr 2025. arXiv preprint arXiv:250409737 14 AI Prompts Prompts used to generate the ChatGPT-based reviews in our study. We used two prompt versions for two different review settings: Prompt A with ChatGPT-4o and Prompt B with Ch...

  36. [36]

    * State clearly what problem the authors are addressing, which LifeCLEF challenge it pertains to, and what their main contributions are.,→

    Summary: * Briefly summarize the task, methods, datasets, and key findings. * State clearly what problem the authors are addressing, which LifeCLEF challenge it pertains to, and what their main contributions are.,→

  37. [37]

    Strengths: List the strong aspects of the paper, such as: * Reproducibility (e.g., availability of code, data, clear methodology) * Careful experimental design * Well-performed ablation studies or error analyses * Insightful discussions of results * Clarity of writing and presentation

  38. [38]

    Weaknesses / Areas for Improvement: Identify any weaknesses or limitations, such as: * Missing details that would prevent reproduction * Lack of ablations or sensitivity analyses * Incomplete or unclear description of the method * Insufficient discussion or interpretation of the results * Missing comparison to appropriate baselines

  39. [39]

    * You may point out specific sections, figures, or tables that need clarification, expansion, or correction.,→ * Comment on both scientific and presentational aspects

    Detailed Comments: * Provide actionable, constructive feedback that the authors can use to improve their paper. * You may point out specific sections, figures, or tables that need clarification, expansion, or correction.,→ * Comment on both scientific and presentational aspects

  40. [40]

    * Do not hallucinate or infer information not present in the submission

    Overall Evaluation: Please provide your overall recommendation, choosing one of: * Strong Accept * Accept * Weak Accept * Borderline * Weak Reject * Reject * Strong Reject Important Reviewing Guidelines: * Focus on scientific rigor, reproducibility, and clarity rather than novelty alone. * Do not hallucinate or infer information not present in the submiss...

  41. [41]

    Comprehension pass: - Identify problem, motivation, and setting (task, inputs/outputs, supervision) - Extract claimed contributions (aim for 3–6 bullet-level items internally) - Identify method components and training/inference pipeline - Locate experiments: datasets, metrics, baselines, evaluation protocol, qualitative results

  42. [42]

    Critical pass: - Check internal consistency: do the claims match the shown evidence? - Check methodological clarity: could a reader reproduce it? - Check experimental fairness: same data, same compute budgets if claimed, correct protocols - Check statistical rigor when applicable (variance, seeds, CIs, significance claims) - Check qualitative evidence: ar...

  43. [43]

    Conference-specific scrutiny: - Novelty/positioning vs. closest prior work in vision - Practicality: compute, memory, inference speed, scaling behavior - Robustness: domain shift, corruptions, hyperparameter sensitivity, ablations - Ethics/societal impact if relevant (privacy, surveillance, dataset bias, misuse) ## [Output Constraints] Use EXACTLY these h...

  44. [44]

    Identification and Background Question:Paper ID(as appears in attached file / Easychair / CMT) Answer: Free text Question:Do you wish to be credited in potential future publication for answering the survey? Answers:Yes / No Question:Name to be credited. Answer: Free text Question:How experienced are you in the field relevant to your submission? Answers: –...

  45. [45]

    Usefulness of the Reviews Question:Overall, was the ChatGPT-generated review useful? Answers:Yes / No Question:Overall, were the human reviews useful? Answers:Yes / Some YES, some NO / No 18 Question:Compared to human review(s), the ChatGPT review was: Answers:Way less useful1 2 3 4 5Way more useful Question:How aligned was the ChatGPT review with the hum...

  46. [46]

    Comparison Between ChatGPT and Human Reviews Question:Did the ChatGPT review identify issues that human review(s) did not mention? Answers:Yes / No Question:Did the human review(s) identify issues that the ChatGPT review did not mention? Answers:Yes / No Question:Who was giving clearer suggestions (i.e., whose suggestions you use)? Answers:The human(s) / ...

  47. [47]

    – Partially — I used some of ChatGPT’s feedback

    Impact on Revision Question:Did you incorporate ChatGPT suggestion into your camera-ready version? Answers: – Yes, fully — I implemented most or all of ChatGPT’s suggestions. – Partially — I used some of ChatGPT’s feedback. – No — I decided not to use ChatGPT’s suggestions

  48. [48]

    hallucination

    Incorrect or Misleading Comments Question:Did the AI review include any incorrect or misleading comments? Answers: – No, all comments were accurate. – Minor inaccuracies, but mostly correct. – Some comments were clearly incorrect or misleading. – Many comments were incorrect or irrelevant. 19 Question:If YES, please briefly reference any example of “hallu...

  49. [49]

    Future Use of AI-Based Review Question:In the future, would you like your paper to be reviewed by AI-based reviewers? Answers:Yes, by both / No, only by humans / Yes, only by AI Question:Would you prefer to be told in advance that your paper will also be reviewed by AI? Answers:Yes / No Question:Would you consider using AI as an internal review tool befor...

  50. [50]

    – Major concerns (e.g., should not be used)

    Concerns and Consent Question:Do you have any concerns about using AI in peer review? Answers: – No – Minor concerns (e.g., should be used with supervision). – Major concerns (e.g., should not be used). – Unsure Question:Were you comfortable that your manuscript text was input to ChatGPT? Answers:Not comfortable at all1 2 3 4 5Very comfortable Question:Sh...

  51. [51]

    Name to be credited

    Final Comments Question:Is there anything else you would like to share regarding this experiment or your experience? Answer:Free text 20 CHERRIES Checklist This checklist is reproduced and completed under the Creative Commons Attribution License (CC BY 2.0), as permitted by the original publication. Reference:Eysenbach G. Improving the quality of Web surv...