Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews
Pith reviewed 2026-05-24 02:46 UTC · model grok-4.3
The pith
A maximum likelihood model estimates that 6.5 to 16.9 percent of text in recent AI conference peer reviews was substantially modified by large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a maximum likelihood estimator, trained only on chosen expert-written and AI-generated reference texts, can recover the fraction of substantially LLM-modified text in a large unlabeled corpus; when run on the four conference review sets it yields the 6.5–16.9 percent interval and further shows that this fraction correlates with reviewer confidence, submission timing, and rebuttal response rate.
What carries the argument
A maximum likelihood model that treats each review as a mixture of expert-written and LLM-generated text and fits the mixing proportion by comparing the observed text to the two reference corpora.
If this is right
- Reviews reporting lower confidence contain a higher estimated share of LLM text.
- Reviews submitted close to the deadline contain a higher estimated share of LLM text.
- Reviewers who respond less often to author rebuttals show higher estimated LLM fractions.
- Corpus-wide statistical patterns in generated text exist that are invisible at the level of any individual review.
Where Pith is reading between the lines
- The same estimation procedure could be rerun on later conference cycles to measure whether the LLM-modified fraction grows or shrinks.
- If the observed correlations hold, conferences might consider deadline policies or confidence prompts as levers that affect LLM adoption.
- The method supplies a scalable way to track LLM influence on other large text collections such as grant reviews or journal submissions.
Load-bearing premise
The reference texts chosen for calibration produce unbiased estimates when the same model is applied to real peer-review writing.
What would settle it
A collection of peer reviews whose authors confirm zero LLM use that the model nonetheless assigns a high LLM-modified fraction, or the reverse pattern on confirmed LLM-heavy reviews.
Figures
read the original abstract
We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a maximum likelihood estimation (MLE) framework that uses separate reference corpora of expert-written and LLM-generated texts to estimate the fraction of substantially LLM-modified content in a target corpus. Applied to peer reviews from ICLR 2024, NeurIPS 2023, CoRL 2023, and EMNLP 2023, the model yields an estimated range of 6.5–16.9% LLM-modified text (beyond minor edits). The work also reports correlations between higher estimated LLM use and lower reviewer confidence, later submission times, and lower rebuttal response rates, and discusses broader implications for peer review.
Significance. If the calibration references are representative of the target domain, the corpus-level estimates and behavioral correlations would provide the first large-scale quantitative evidence on LLM adoption in AI conference reviewing and could inform policy on disclosure and detection. The method's efficiency for corpus-scale analysis is a practical strength, but its validity rests on untested assumptions about reference-text similarity.
major comments (1)
- [§3 and §4] §3 (Methods) and §4 (Results): The MLE is calibrated exclusively on the chosen expert-written and AI-generated reference texts, yet no quantitative validation (e.g., feature-distribution overlap, perplexity, or domain-adaptation metrics) is reported to confirm that these references share the same vocabulary, syntactic, and technical-density distributions as the peer-review corpus. Systematic mismatch would bias the 6.5–16.9% fraction, directly undermining the central claim.
minor comments (2)
- [Abstract] The abstract and introduction should explicitly state the exact definition of “substantially modified” used in the MLE threshold.
- [Figures 3–5] Figure captions and axis labels in the behavioral-correlation plots should include sample sizes and confidence intervals for each bin.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address the major comment point by point below and commit to revisions where appropriate.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Methods) and §4 (Results): The MLE is calibrated exclusively on the chosen expert-written and AI-generated reference texts, yet no quantitative validation (e.g., feature-distribution overlap, perplexity, or domain-adaptation metrics) is reported to confirm that these references share the same vocabulary, syntactic, and technical-density distributions as the peer-review corpus. Systematic mismatch would bias the 6.5–16.9% fraction, directly undermining the central claim.
Authors: We agree that explicit quantitative validation of reference-text similarity to the target corpus would strengthen the paper and reduce concerns about potential bias in the MLE estimates. The references were chosen for domain relevance (prior conference reviews and LLM outputs prompted to produce review-style text), but we did not include overlap metrics in the original submission. In the revised manuscript we will add: (1) vocabulary overlap via Jaccard similarity on unigrams and bigrams, (2) perplexity of both reference sets under a held-out RoBERTa model, and (3) distributions of syntactic and technical features (sentence length, POS-tag entropy, and density of AI-conference terms). These will appear in an expanded §3 with a new discussion of any observed mismatches and their possible effect on the 6.5–16.9% range. We view this as a substantive improvement rather than a change to the core method or results. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's core estimation uses a maximum likelihood model calibrated exclusively on separate expert-written and AI-generated reference corpora, then applies the fitted model to an independent target corpus of peer reviews. This structure does not reduce the target estimates to the calibration inputs by construction, nor does the abstract or described method invoke self-citations, uniqueness theorems, or ansatzes that would make the result tautological. The derivation remains externally grounded in the reference data and is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert-written and AI-generated reference texts are sufficiently representative to calibrate the model for real peer-review text.
Forward citations
Cited by 9 Pith papers
-
AgentReview: Exploring Peer Review Dynamics with LLM Agents
AgentReview is the first LLM-based simulation framework for peer review that quantifies a 37.1% decision variation attributable to reviewer biases.
-
PeerPrism: Peer Evaluation Expertise vs Review-writing AI
PeerPrism benchmark demonstrates that state-of-the-art LLM detectors conflate surface text style with intellectual contribution and fail on hybrid human-AI peer reviews.
-
The Impact of AI-Generated Text on the Internet
By mid-2025 roughly 35% of new websites are AI-generated or AI-assisted, correlating with lower semantic diversity and higher positive sentiment but showing no significant drop in factual accuracy or stylistic diversity.
-
Detecting Verbatim LLM Copy-Paste in Homework
SteganoPrompt embeds a hidden instruction in assignment prompts via the Unicode Tags block so that LLMs add a detectable signature to responses when the prompt is pasted verbatim.
-
Rethinking Publication: A Certification Framework for AI-Enabled Research
The paper introduces a certification framework that grades AI research contributions into Categories A, B, and C based on pipeline reach at submission time and adds benchmark slots for fully automated work.
-
Rethinking Publication: A Certification Framework for AI-Enabled Research
A two-layer certification framework decouples knowledge validity from human authorship to accommodate AI-enabled research in existing publication systems.
-
AI Disclosure with DAISY
DAISY is a structured form tool that generates more complete AI disclosure statements for research papers without reducing author comfort levels.
-
Publish and Perish: How AI-Accelerated Writing Without Proportional Verification Investment Degrades Scientific Knowledge
A minimal ODE model of AI adoption in writing and reviewing predicts a short-term knowledge peak followed by 40% long-term decline unless review acceleration exceeds writing acceleration.
-
Justice in Judgment: Unveiling (Hidden) Bias in LLM-assisted Peer Reviews
Controlled prompt interventions reveal strong affiliation bias in LLM peer reviews favoring top-ranked institutions, plus effects from seniority and publication history.
Reference graph
Works this paper leans on
-
[1]
URL https://api.semanticscholar. org/CorpusID:263831345. Bearman, M., Ryan, J., and Ajjawi, R. Discourses of artifi- cial intelligence in higher education: A critical literature review. Higher Education, 86(2):369–385, 2023. Beresneva, D. Computer-Generated Text Detection Us- ing Machine Learning: A Systematic Review. In International Conference on Applic...
work page 2023
-
[2]
URL https://api.semanticscholar. org/CorpusID:32452685. Bhattacharjee, A., Kumarage, T., Moraffah, R., and Liu, H. ConDA: Contrastive Domain Adaptation for AI- generated Text Detection. ArXiv, abs/2309.03992,
-
[3]
URL https://api.semanticscholar. org/CorpusID:261660497. Bommasani, R., Creel, K. A., Kumar, A., Jurafsky, D., and Liang, P. S. Picking on the same person: Does algo- rithmic monoculture lead to outcome homogenization? Advances in Neural Information Processing Systems, 35: 3663–3678, 2022. Cantor, M. Nearly 50 news websites are ‘AI-generated’, a study say...
-
[4]
URL https://api.semanticscholar. org/CorpusID:258686680. Chiang, Y .-L., Chang, L.-P., Hsieh, W.-T., and Chen, W.- C. Natural Language Watermarking Using Semantic Substitution for Chinese Text. In International Workshop on Digital Watermarking, 2003. URL https://api. semanticscholar.org/CorpusID:40971354. 11 Monitoring AI-Modified Content at Scale: A Case...
-
[5]
URL https://api.semanticscholar. org/CorpusID:260315804. Lamont, M. How professors think: Inside the curious world of academic judgment. Harvard University Press, 2009. Lamont, M. Toward a comparative sociology of valuation and evaluation. Annual review of sociology, 38:201–221, 2012. Lavergne, T., Urvoy, T., and Yvon, F. Detect- ing Fake Content with Rel...
work page 2009
-
[6]
URL https://api.semanticscholar. org/CorpusID:12098535. Li, Y ., Li, Q., Cui, L., Bi, W., Wang, L., Yang, L., Shi, S., and Zhang, Y . Deepfake Text Detection in the Wild. ArXiv, abs/2305.13242,
-
[7]
URL https://api.semanticscholar. org/CorpusID:258832454. Liang, W., Yuksekgonul, M., Mao, Y ., Wu, E., and Zou, J. Y . GPT detectors are biased against non-native English writers. ArXiv, abs/2304.02819, 2023a. Liang, W., Zhang, Y ., Cao, H., Wang, B., Ding, D., Yang, X., V odrahalli, K., He, S., Smith, D., Yin, Y ., McFarland, D., and Zou, J. Can large la...
-
[8]
URL https://api.semanticscholar. org/CorpusID:263830310. Liu, Q., Zhou, Y ., Huang, J., and Li, G. When chatgpt is gone: Creativity reverts and homogeneity persists, 2024. Liu, R. and Shah, N. B. Reviewergpt? an exploratory study on using large language models for paper reviewing. arXiv preprint arXiv:2306.00622, 2023. Liu, X., Zhang, Z., Wang, Y ., Lan, ...
-
[9]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
URL https://api.semanticscholar. org/CorpusID:254877728. Liu, Y ., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V . RoBERTa: A Robustly Optimized BERT Pretraining Ap- proach. ArXiv, abs/1907.11692, 2019. Longino, H. E. Science as social knowledge: Values and objectivity in scientific inquiry. Princet...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[10]
URL https://api.semanticscholar. org/CorpusID:258987266. Singh, K. and Zou, J. New Evaluation Metrics Capture Quality Degradation due to LLM Watermarking. arXiv preprint arXiv:2312.02382, 2023. Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert- V oss, A., Wu, J., Radford, A., Krueger, G., Kim, J. W., Kreps, S., et al. Release strategies and the s...
-
[11]
URL https://api.semanticscholar. org/CorpusID:221835708. Van Noorden, R. and Perkel, J. M. Ai and science: what 1,600 researchers think. Nature, 621(7980):672–675, 2023. Van Rossum, D. Generative AI Top 150: The World’s Most Used AI Tools. https://www.flexos. work/learn/generative-ai-top-150 , Febru- ary 2024. Walters, W. H. and Wilder, E. I. Fabrication ...
-
[12]
URL https://api.semanticscholar. org/CorpusID:263834753. Yang, X., Cheng, W., Petzold, L., Wang, W. Y ., and Chen, H. DNA-GPT: Divergent N-Gram Analy- sis for Training-Free Detection of GPT-Generated Text. ArXiv, abs/2305.17359, 2023a. URL https: //api.semanticscholar.org/CorpusID: 258960101. 14 Monitoring AI-Modified Content at Scale: A Case Study on the...
-
[13]
Llm paternity test: Generated text detection with llm genetic inheritance,
URL https://api.semanticscholar. org/CorpusID:259129912. Yu, X., Qi, Y ., Chen, K., Chen, G., Yang, X., Zhu, P., Zhang, W., and Yu, N. H. GPT Pa- ternity Test: GPT Generated Text Detection with GPT Genetic Inheritance. ArXiv, abs/2305.12519,
-
[14]
Defending against neural fake news
URL https://api.semanticscholar. org/CorpusID:258833423. Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y ., Farhadi, A., Roesner, F., and Choi, Y . Defending Against Neural Fake News. ArXiv, abs/1905.12616,
-
[15]
URL https://api.semanticscholar. org/CorpusID:168169824. Zhang, Y .-F., Zhang, Z., Wang, L., Tan, T.-P., and Jin, R. Assaying on the Robustness of Zero-Shot Machine- Generated Text Detectors. ArXiv, abs/2312.12918,
-
[16]
URL https://api.semanticscholar. org/CorpusID:266375086. Zhao, X., Ananth, P. V ., Li, L., and Wang, Y .-X. Provable Robust Watermarking for AI-Generated Text. ArXiv, abs/2306.17439, 2023. URL https: //api.semanticscholar.org/CorpusID: 259308864. 15 Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Review...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.