arxiv: 2604.13940 · v1 · submitted 2026-04-15 · 💻 cs.AI

Recognition: unknown

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

Joydeep Biswas , Sheila Schoepp , Gautham Vasan , Anthony Opipari , Arthur Zhang , Zichao Hu , Sebastian Joseph , Matthew Lease

show 5 more authors

Junyi Jessy Li Peter Stone Kiri L. Wagstaff Matthew E. Taylor Odest Chadwicke Jenkins

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI-assisted peer reviewpeer review qualityscientific reviewweakness detectionhuman-AI teaminglarge-scale deployment

0 comments

The pith

State-of-the-art AI methods can generate peer reviews preferred by authors and committee members for technical accuracy at conference scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current AI systems are capable of producing reviews that contribute meaningfully to the peer review process even when handling tens of thousands of submissions. It did so by running a full deployment where every submission received an identified AI review generated through a multi-stage process. Surveys of authors and program committee members revealed a preference for the AI reviews over human reviews specifically on technical accuracy and the quality of research suggestions provided. The system also showed superior performance on a new benchmark designed to test detection of scientific weaknesses compared to a basic large language model approach. This matters because peer review is under strain from increasing submission volumes, and effective AI support could help maintain quality and timeliness.

Core claim

A multi-stage system combining frontier models, tool use, and safeguards generated AI reviews for every main-track submission at the conference. Surveys indicated that authors and program committee members not only found the AI reviews useful but preferred them to human reviews on key dimensions such as technical accuracy and research suggestions. A novel benchmark demonstrated that the system substantially outperforms a simple LLM-generated review baseline at detecting various scientific weaknesses.

What carries the argument

The multi-stage AI review generation system that uses frontier models with tool use and safeguards to create reviews for all submissions.

Load-bearing premise

Survey responses from authors and program committee members reflect genuine review quality rather than being skewed by novelty effects or other unmeasured biases.

What would settle it

A follow-up experiment in which independent experts, blind to the source, rate paired AI and human reviews on the same set of papers for technical soundness, completeness, and helpfulness, with results showing no advantage for AI.

Figures

Figures reproduced from arXiv: 2604.13940 by Anthony Opipari, Arthur Zhang, Gautham Vasan, Joydeep Biswas, Junyi Jessy Li, Kiri L. Wagstaff, Matthew E. Taylor, Matthew Lease, Odest Chadwicke Jenkins, Peter Stone, Sebastian Joseph, Sheila Schoepp, Zichao Hu.

**Figure 2.** Figure 2: Survey responses: AI vs. human review comparisons (a) and AI review questions (b). The left figure shows the differences in the mean response score between AI and human reviews for each of the nine review-quality criteria. In six out of nine criteria, AI reviews were rated higher than human reviews. The preference towards AI reviews was stronger for authors than for PC, SPC, and ACs. All p-values show stro… view at source ↗

**Figure 3.** Figure 3: Top five most frequent positive and negative themes specific to the AAAI-26 AI Review Pilot found in written [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The SPECS Review Benchmark curation and analysis workflow (a) and stage-by-criterion detection rates (b) from [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Scientific peer review faces mounting strain as submission volumes surge, making it increasingly difficult to sustain review quality, consistency, and timeliness. Recent advances in AI have led the community to consider its use in peer review, yet a key unresolved question is whether AI can generate technically sound reviews at real-world conference scale. Here we report the first large-scale field deployment of AI-assisted peer review: every main-track submission at AAAI-26 received one clearly identified AI review from a state-of-the-art system. The system combined frontier models, tool use, and safeguards in a multi-stage process to generate reviews for all 22,977 full-review papers in less than a day. A large-scale survey of AAAI-26 authors and program committee members showed that participants not only found AI reviews useful, but actually preferred them to human reviews on key dimensions such as technical accuracy and research suggestions. We also introduce a novel benchmark and find that our system substantially outperforms a simple LLM-generated review baseline at detecting a variety of scientific weaknesses. Together, these results show that state-of-the-art AI methods can already make meaningful contributions to scientific peer review at conference scale, opening a path toward the next generation of synergistic human-AI teaming for evaluating research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AAAI ran AI reviews on every one of 23k papers and got survey preference over human ones, but missing response rates, bias controls, and benchmark validation leave the quality claims under-supported.

read the letter

The main thing to know is that this is a genuine large-scale deployment report: AAAI-26 gave every main-track paper one clearly labeled AI review generated in under a day, then ran a survey of authors and PC members plus a new benchmark for spotting scientific weaknesses. That combination has not been done before at conference scale. The setup used frontier models with tool use and safeguards, which is a practical data point on feasibility that earlier smaller experiments did not reach. The survey finding that participants rated the AI reviews higher on technical accuracy and research suggestions is worth noting, and the benchmark at least shows improvement over a basic LLM baseline. Those elements make the paper a useful record of what was tried and how it landed in one real conference setting. The gaps are straightforward. The abstract supplies no survey response rate, no sampling description, and no statistical tests, so it is impossible to judge whether the preferences reflect broad opinion or just who chose to reply. Because the reviews were labeled as AI, any positivity could stem from novelty or social effects rather than actual quality. The benchmark description is also thin: we do not learn how the weakness examples were chosen, whether they match real reviewer failures, or how the system would compare to human reviewers on the same set. Beating a simple baseline is a low bar. These issues are not fatal for a deployment paper, but they do mean the stronger claims about “meaningful contributions” rest on preliminary evidence. The work is aimed at conference organizers and researchers building review tools. Anyone planning similar pilots will find the scale and logistics sections directly relevant, even if they need to add their own controls. It is coherent enough on its own terms to deserve referee time rather than a desk reject; the deployment itself is rare enough that feedback on the evaluation gaps would strengthen the record.

Referee Report

3 major / 2 minor

Summary. The paper reports the first large-scale deployment of an AI peer-review system at AAAI-26, generating clearly labeled AI reviews for all 22,977 main-track submissions in under a day using frontier models, tool use, and safeguards. It presents survey results from authors and PC members indicating preference for AI reviews over human ones on technical accuracy and research suggestions, introduces a novel benchmark where the system outperforms a simple LLM-generated review baseline at detecting scientific weaknesses, and concludes that state-of-the-art AI can already make meaningful contributions to peer review at conference scale.

Significance. If the empirical results hold, this is a significant contribution as the first reported real-world, conference-scale field test of AI-assisted review. The deployment scale (nearly 23k papers) and dual evidence from survey plus benchmark provide concrete data on feasibility. Credit is due for the practical engineering of the multi-stage pipeline and for releasing a new benchmark for review quality assessment.

major comments (3)

[Survey results section] Survey results section: response rates, sampling frame, and any statistical tests comparing AI vs. human reviews on accuracy/suggestions are not reported. This is load-bearing for the central claim, as the reported preference cannot be interpreted without these details (potential self-selection or novelty bias unmeasured).
[Benchmark section] Benchmark section: the set of scientific weaknesses is constructed internally without external validation against documented real-world review failures or direct comparison to human reviewer detection rates. This undermines the claim of superiority over the simple LLM baseline for practical review utility.
[Abstract and evaluation sections] Abstract and evaluation sections: the survey explicitly labels AI reviews as such, yet no controls or measurements for social-desirability or positivity bias are described, leaving open whether preferences reflect objective quality or labeling effects.

minor comments (2)

[Benchmark section] Clarify the exact composition of the 'simple LLM-generated review baseline' (prompting details, model version) to allow replication.
[System description] The multi-stage pipeline description would benefit from a diagram or pseudocode to illustrate the safeguards and tool-use steps.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments and for acknowledging the significance of this large-scale deployment. We provide point-by-point responses to the major comments below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Survey results section] Survey results section: response rates, sampling frame, and any statistical tests comparing AI vs. human reviews on accuracy/suggestions are not reported. This is load-bearing for the central claim, as the reported preference cannot be interpreted without these details (potential self-selection or novelty bias unmeasured).

Authors: We agree that these details are necessary to fully interpret the survey results. In the revised manuscript, we will add the response rates and sampling frame details (all authors and PC members were invited to the survey). We will also report the statistical tests used to compare preferences between AI and human reviews on the dimensions of accuracy and suggestions. Furthermore, we will expand the limitations section to discuss potential self-selection and novelty biases. revision: yes
Referee: [Benchmark section] Benchmark section: the set of scientific weaknesses is constructed internally without external validation against documented real-world review failures or direct comparison to human reviewer detection rates. This undermines the claim of superiority over the simple LLM baseline for practical review utility.

Authors: The benchmark provides a standardized way to evaluate the AI system's performance on detecting predefined scientific weaknesses, and our claim is specifically that it outperforms the simple LLM baseline on this benchmark. We will revise the section to provide more detail on the construction of the weakness categories, drawing from common issues in peer review. We will also add an explicit discussion of the limitations, including the internal construction and lack of direct comparison to human reviewer performance, as we do not have such paired data available. revision: partial
Referee: [Abstract and evaluation sections] Abstract and evaluation sections: the survey explicitly labels AI reviews as such, yet no controls or measurements for social-desirability or positivity bias are described, leaving open whether preferences reflect objective quality or labeling effects.

Authors: We acknowledge the potential for labeling effects in the survey design. The revised manuscript will include additional text in the evaluation section discussing this possible bias and its implications for interpreting the preference results. We note that while no blinded control was implemented, the survey was conducted after the reviews were provided, and preferences were consistent across different groups of respondents. revision: yes

standing simulated objections not resolved

Direct comparison to human reviewer detection rates on the benchmark, as this would require a separate study with human reviewers evaluating the same set of papers for the defined weaknesses.

Circularity Check

0 steps flagged

No significant circularity: empirical deployment report grounded in external data collection

full rationale

The paper presents results from a real-world deployment of AI-generated reviews for all AAAI-26 submissions, followed by surveys of authors and PC members plus a new benchmark for detecting scientific weaknesses. No mathematical derivation chain, equations, fitted parameters, or self-referential definitions exist. Central claims rest on collected survey responses and benchmark performance against an external baseline, with no load-bearing steps that reduce by construction to the paper's own inputs or prior self-citations. This is a standard empirical field study whose validity hinges on data quality rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of self-reported survey preferences as a proxy for review quality and on the benchmark being a faithful test of real scientific weaknesses; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Survey responses from authors and program committee members provide an unbiased measure of review usefulness and technical accuracy
The preference for AI reviews is presented as evidence of quality; this rests on the assumption that participants' ratings reflect objective merit rather than novelty or other biases.

pith-pipeline@v0.9.0 · 5568 in / 1212 out tokens · 50602 ms · 2026-05-10T12:55:05.779336+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 21 canonical work pages · 4 internal anchors

[1]

Why did the nature index grow by 16% in 2024?Nature Index, 2025

Nature Index. Why did the nature index grow by 16% in 2024?Nature Index, 2025. URL https://www.nature.com/nature-index/news/why-did- the-nature-index-grow-by-sixteen-percent-in-twenty- twenty-four. Accessed: 2026-03-28

2024
[2]

The mirage of au- tonomous ai scientists

Chenhao Tan and Haokun Liu. The mirage of au- tonomous ai scientists. https://chenhaot.com/papers/ mirage ai scientist.pdf, 2026. February 2

2026
[3]

AAAI-26 Review Process Update: Scale, Integrity Measures, and Pathways to Sustainabil- ity

Association for the Advancement of Artificial Intel- ligence. AAAI-26 Review Process Update: Scale, Integrity Measures, and Pathways to Sustainabil- ity. https://aaai.org/conference/aaai/aaai-26/review- process-update/, 2025. Accessed: 2026-03-14

2025
[4]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeek- Math: Pushing the Limits of Mathematical Reason- ing in Open Language Models, 2024. URL https: //arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y . Wu, Y . K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence,
[6]

URL https://arxiv.org/abs/2401.14196

work page internal anchor Pith review arXiv
[7]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jack- son Petty, Richard Yuanzhe Pang, Julien Dirani, Ju- lian Michael, and Samuel R. Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark. In Proceedings of the Conference on Language Mod- eling, 2024. URL https://openreview.net/forum?id= Ti67584b98

2024
[8]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foer- ster, Jeff Clune, and David Ha. The AI scientist: To- wards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review arXiv 2024
[9]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi Mishra, Bod- hisattwa Prasad Majumder, Daniel S Weld, and Pe- ter Clark. CodeScientist: End-to-end semi-automated scientific discovery with code-based experimentation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of ...

work page doi:10.18653/v1/2025.findings-acl 2025
[10]

URL https://aclanthology.org/2025.findings-acl. 692/

2025
[11]

Towards execution-grounded automated ai research, 2026

Chenglei Si, Zitong Yang, Yejin Choi, Emmanuel Cand`es, Diyi Yang, and Tatsunori Hashimoto. Towards execution-grounded automated AI research.arXiv preprint arXiv:2601.14525, 2026

work page arXiv 2026
[12]

Researchgym: Evaluating language model agents on real-world ai research.arXiv preprint arXiv:2602.15112, 2026

Aniketh Garikaparthi, Manasi Patwardhan, and Ar- man Cohan. Researchgym: Evaluating language model agents on real-world AI research.arXiv preprint arXiv:2602.15112, 2026

work page arXiv 2026
[13]

Yu, and Wenpeng Yin

Jiangshu Du, Yibo Wang, Wenting Zhao, Zhongfen Deng, Shuaiqi Liu, Renze Lou, Henry Peng Zou, Pranav Narayanan Venkit, Nan Zhang, Mukund Sri- nath, Haoran Ranran Zhang, Vipul Gupta, Yinghui Li, Tao Li, Fei Wang, Qin Liu, Tianlin Liu, Pengzhi Gao, Congying Xia, Chen Xing, Cheng Jiayang, Zhaowei Wang, Ying Su, Raj Sanjay Shah, Ruohao Guo, Jing Gu, Haoran Li,...

2024
[14]

URL https://aclanthology.org/2024.emnlp-main.292/

Association for Computational Linguistics. URL https://aclanthology.org/2024.emnlp-main.292/

2024
[15]

McFarland, and James Zou

Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas V odrahalli, Siyu He, Daniel Smith, Yian Yin, Daniel McFarland, and James Zou. Can Large Language Models Pro- vide Useful Feedback on Research Papers? A Large- Scale Empirical Analysis.NEJM AI, 1(8), 2024. doi: 10.1056/AIoa2400196. URL https://ai.nejm.org/doi/ abs/10.1056...

work page doi:10.1056/aioa2400196 2024
[16]

Weixin Liang, Zachary Izzo, Yaohui Zhang, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, Daniel Mcfarland, and James Y . Zou. Monitoring AI-Modified Con- tent at Scale: A Case Study on the Impact of Chat- GPT on AI Conference Peer Reviews. InProceed- ings of the 41st International Conference on Machine Learning, ...

2024
[17]

Ryan Liu and Nihar B. Shah. ReviewerGPT? An Ex- ploratory Study on Using Large Language Models for Paper Reviewing, 2023. URL https://arxiv.org/abs/ 2306.00622

work page arXiv 2023
[18]

Automated peer reviewing in paper SEA: Standardization, evaluation, and analy- sis

Jianxiang Yu, Zichen Ding, Jiaqi Tan, Kangyang Luo, Zhenmin Weng, Chenghua Gong, Long Zeng, RenJing Cui, Chengcheng Han, Qiushi Sun, Zhiyong Wu, Yun- shi Lan, and Xiang Li. Automated peer reviewing in paper SEA: Standardization, evaluation, and analy- sis. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Com- p...

2024
[19]

Is LLM a Reli- able Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks

Ruiyang Zhou, Lu Chen, and Kai Yu. Is LLM a Reli- able Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks. InProceed- ings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9340–9351, Torino, Italia, 2024. ELRA and ICCL. URL https: //aclanthology...

2024
[20]

Genera- tive Reviewer Agents: Scalable Simulacra of Peer Re- view

Nicolas Bougie and Narimawa Watanabe. Genera- tive Reviewer Agents: Scalable Simulacra of Peer Re- view. InProceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing: In- dustry Track, pages 98–116, Suzhou, China, 2025. As- sociation for Computational Linguistics. URL https: //aclanthology.org/2025.emnlp-industry.8/

2025
[21]

ReviewAgents: Bridg- ing the Gap Between Human and AI-Generated Paper Reviews, 2025

Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, and Yuzhuo Fu. ReviewAgents: Bridg- ing the Gap Between Human and AI-Generated Paper Reviews, 2025. URL https://arxiv.org/abs/2503.08506

work page arXiv 2025
[22]

Mind the blind spots: A focus-level evaluation framework for LLM reviews

Hyungyu Shin, Jingyu Tang, Yoonjoo Lee, Nayoung Kim, Hyunseung Lim, Ji Yong Cho, Hwajung Hong, Moontae Lee, and Juho Kim. Mind the blind spots: A focus-level evaluation framework for LLM reviews. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Lan...

2025
[23]

Can LLMs identify criti- cal limitations within scientific research? a system- atic evaluation on AI research papers

Zhijian Xu, Yilun Zhao, Manasi Patwardhan, Lovekesh Vig, and Arman Cohan. Can LLMs identify criti- cal limitations within scientific research? a system- atic evaluation on AI research papers. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Moham- mad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational...

2025
[24]

More than half of researchers now use AI for peer review—often against guidance.Na- ture, 649(8096):273–274, 2026

Miryam Naddaf. More than half of researchers now use AI for peer review—often against guidance.Na- ture, 649(8096):273–274, 2026

2026
[25]

arXiv preprint arXiv:2405.02150

Giuseppe Russo Latona, Manoel Horta Ribeiro, Tim R. Davidson, Veniamin Veselovsky, and Robert West. The AI Review Lottery: Widespread AI-Assisted Peer Re- views Boost Paper Scores and Acceptance Rates, 2024. URL https://arxiv.org/abs/2405.02150

work page arXiv 2024
[26]

Sarina Xi, Vishisht Rao, Justin Payan, and Nihar B. Shah. FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers, 2025. URL https: //arxiv.org/abs/2511.21843

work page arXiv 2025
[27]

Larochelle, Laurent Charlin, and Christopher Pal

Gaurav Sahu, Hugo Larochelle, Laurent Charlin, and Christopher Pal. ReviewerToo: Should AI Join The Program Committee? A Look At The Future of Peer Review.arXiv preprint arXiv:2510.08867, 2025

work page arXiv 2025
[28]

AAAR-1.0: Assess- ing AI’s Potential to Assist Research

Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, and Wenpeng Yin. AAAR-1.0: Assess- ing AI’s Potential to Assist Research. InProceed- ings of the 42nd International Conference on Machine Learning, v...

2025
[29]

ReviewEval: An evaluation framework for AI- generated reviews

Madhav Krishan Garg, Tejash Prasad, Tanmay Sing- hal, Chhavi Kirtani, Murari Mandal, and Dhruv Ku- mar. ReviewEval: An evaluation framework for AI- generated reviews. In Christos Christodoulopou- los, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Com- putational Linguistics: EMNLP 2025, pages 20542– 20564, Suzh...

2025
[30]

Alexander Goldberg, Ihsan Ullah, Thanh Gia Hieu Khuong, Benedictus Kent Rachmat, Zhen Xu, Isabelle Guyon, and Nihar B. Shah. Usefulness of LLMs as an Author Checklist Assistant for Scientific Papers: NeurIPS’24 Experiment, 2024. URL https://arxiv.org/ abs/2411.03417

work page arXiv 2024
[31]

A large-scale randomized study of large language model feedback in peer re- view.Nature Machine Intelligence, pages 1–11, 2026

Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, Ani- mesh Garg, Nanyun Peng, Fei Sha, Rose Yu, Carl V ondrick, and James Zou. A large-scale randomized study of large language model feedback in peer re- view.Nature Machine Intelligence, pages 1–11, 2026. doi: 10.1038/s42256-026-01188-x. URL https://doi. org/10.1038/s42256-026-01188-x

work page doi:10.1038/s42256-026-01188-x 2026
[32]

Zhang and Neil F

Tianmai M. Zhang and Neil F. Abernethy. Reviewing Scientific Papers for Critical Problems With Reason- ing LLMs: Baseline Approaches and Automatic Eval- uation, 2025. URL https://arxiv.org/abs/2505.23824. NeurIPS 2025 Workshop on AI for Science: The Reach and Limits of AI for Scientific Discovery

work page arXiv 2025
[33]

M. Zhu, Y . Weng, L. Yang, and Y . Zhang. Deep- Review: Improving LLM-Based Paper Review with Human-Like Deep Thinking Process. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29330–29355, Vienna, Austria, 2025. Associ- ation for Computational Linguistics. doi: 10.18653/ v1/2025.acl-...

2025
[34]

Chang, Z

Y . Chang, Z. Li, H. Zhang, Y . Kong, Y . Wu, H. K.- H. So, Z. Guo, L. Zhu, and N. Wong. TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-Based Scientific Peer Review. InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 15651– 15682, Suzhou, China, 2025. Association for Com- putat...

2025
[35]

K. Lu, S. Xu, J. Li, K. Ding, and G. Meng. Agent Reviewers: Domain-Specific Multimodal Agents with Shared Memory for Paper Review. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 40803–40830, Vancouver, Canada, 2025. PMLR. URL https://proceedings.mlr. press/v267/lu25p.html

2025
[36]

AI Reviewers: Are Hu- man Reviewers Still Necessary? InProceedings of the Human Factors and Ergonomics Society Annual Meet- ing, volume 69, pages 338–342

Vianney Renata and John Lee. AI Reviewers: Are Hu- man Reviewers Still Necessary? InProceedings of the Human Factors and Ergonomics Society Annual Meet- ing, volume 69, pages 338–342. SAGE Publications Sage CA: Los Angeles, CA, 2025

2025
[37]

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models.arXiv preprint arXiv:2502.18443, 2025

Jake Poznanski, Aman Rangapur, Jon Borchardt, Ja- son Dunkelberger, Regan Huff, Daniel Lin, Christo- pher Wilhelm, Kyle Lo, and Luca Soldaini. olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models, 2025. URL https://arxiv.org/abs/ 2502.18443

work page arXiv 2025
[38]

Markdown: Syntax

John Gruber. Markdown: Syntax. http://daringfireball. net/projects/markdown/syntax. Retrieved on June, 2012

2012
[39]

Alexander Goldberg, Ivan Stelmakh, Kyunghyun Cho, Alice Oh, Alekh Agarwal, Danielle Belgrave, and Ni- har B. Shah. Peer reviews of peer reviews: A ran- domized controlled trial and other experiments.PLOS ONE, 20(4):e0320444, 2025. doi: 10.1371/journal. pone.0320444. URL https://journals.plos.org/plosone/ article?id=10.1371/journal.pone.0320444

work page doi:10.1371/journal 2025
[40]

Alexander Knipper, Souvika Sarkar, John Salvador, Yash Mahajan, Sri Ram Pavan Kumar Guttikonda, Mousumi Akter, Md

Eftekhar Hossain, Sanjeev Kumar Sinha, Naman Bansal, R. Alexander Knipper, Souvika Sarkar, John Salvador, Yash Mahajan, Sri Ram Pavan Kumar Guttikonda, Mousumi Akter, Md. Mahadi Hassan, Matthew Freestone, Matthew C. Williams Jr., Dongji Feng, and Santu Karmaker. LLMs as meta-reviewers’ assistants: A case study. In Luis Chiruzzo, Alan Rit- ter, and Lu Wang...

2025
[41]

URL https://aclanthology.org/2025.naacl-long.395/

Association for Computational Linguistics. URL https://aclanthology.org/2025.naacl-long.395/

2025
[42]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. OpenAI GPT-5 System Card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

File inputs

OpenAI. File inputs. https://developers.openai.com/ api/docs/guides/file-inputs/, 2026. OpenAI API docu- mentation. Accessed: 2026-03-16

2026
[44]

GPTZero Developer Docs

GPTZero. GPTZero Developer Docs. https://gptzero. me/docs, 2026. API documentation. Accessed: 2026- 03-16

2026
[45]

Mercedes T

Michael J. Parker, Caitlin Anderson, Claire Stone, and YeaRim Oh. A large language model approach to edu- cational survey feedback analysis.International Jour- nal of Artificial Intelligence in Education, 35(2):444– 481, Jun 2025. ISSN 1560-4306. doi: 10.1007/s40593- 024-00414-0. URL https://doi.org/10.1007/s40593- 024-00414-0

work page doi:10.1007/s40593- 2025
[46]

From voices to validity: Leveraging large language models (llms) for textual analysis of policy stakeholder interviews.AERA Open, 11:23328584251374595, 2025

Alex Liu and Min Sun. From voices to validity: Leveraging large language models (llms) for textual analysis of policy stakeholder interviews.AERA Open, 11:23328584251374595, 2025. doi: 10.1177/ 23328584251374595

2025
[47]

Paul Rozin and Edward B. Royzman. Negativity Bias, Negativity Dominance, and Contagion.Personality and Social Psychology Review, 5(4):296–320, 2001. doi: 10.1207/S15327957PSPR0504 2

work page doi:10.1207/s15327957pspr0504 2001
[48]

Poncheri, Jennifer T

Reanna M. Poncheri, Jennifer T. Lindberg, Lori Foster Thompson, and Eric A. Surface. A Comment on Em- ployee Surveys: Negativity Bias in Open-Ended Re- sponses.Organizational Research Methods, 11(3): 614–630, 2008. doi: 10.1177/1094428106295504. A Appendix A.1 Prompt design We report the prompt structure and design for the AAAI- 26 AI Review System. The e...

work page doi:10.1177/1094428106295504 2008