pith. sign in

arxiv: 2403.07183 · v4 · pith:JLKT2QTNnew · submitted 2024-03-11 · 💻 cs.CL · cs.AI· cs.LG· cs.SI

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

Pith reviewed 2026-05-24 02:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.SI
keywords LLM-generated textpeer reviewAI conferencesmaximum likelihood estimationcontent monitoringChatGPT impactscientific publishing
0
0 comments X

The pith

A maximum likelihood model estimates that 6.5 to 16.9 percent of text in recent AI conference peer reviews was substantially modified by large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a statistical method that uses reference sets of expert-written and AI-generated text to estimate the share of corpus text likely produced or heavily altered by LLMs. When applied to peer reviews from ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023, the model returns a range of 6.5 to 16.9 percent of review text showing substantial LLM involvement beyond minor edits. The estimated share rises in reviews that express lower reviewer confidence, arrive near submission deadlines, and come from reviewers less likely to reply to author rebuttals. The approach also surfaces aggregate patterns across the corpus that would be difficult to spot in any single review. These measurements are offered as evidence that LLM use is already altering the daily practice of scientific evaluation.

Core claim

The central claim is that a maximum likelihood estimator, trained only on chosen expert-written and AI-generated reference texts, can recover the fraction of substantially LLM-modified text in a large unlabeled corpus; when run on the four conference review sets it yields the 6.5–16.9 percent interval and further shows that this fraction correlates with reviewer confidence, submission timing, and rebuttal response rate.

What carries the argument

A maximum likelihood model that treats each review as a mixture of expert-written and LLM-generated text and fits the mixing proportion by comparing the observed text to the two reference corpora.

If this is right

  • Reviews reporting lower confidence contain a higher estimated share of LLM text.
  • Reviews submitted close to the deadline contain a higher estimated share of LLM text.
  • Reviewers who respond less often to author rebuttals show higher estimated LLM fractions.
  • Corpus-wide statistical patterns in generated text exist that are invisible at the level of any individual review.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same estimation procedure could be rerun on later conference cycles to measure whether the LLM-modified fraction grows or shrinks.
  • If the observed correlations hold, conferences might consider deadline policies or confidence prompts as levers that affect LLM adoption.
  • The method supplies a scalable way to track LLM influence on other large text collections such as grant reviews or journal submissions.

Load-bearing premise

The reference texts chosen for calibration produce unbiased estimates when the same model is applied to real peer-review writing.

What would settle it

A collection of peer reviews whose authors confirm zero LLM use that the model nonetheless assigns a high LLM-modified fraction, or the reverse pattern on confirmed LLM-heavy reviews.

Figures

Figures reproduced from arXiv: 2403.07183 by Daniel A. McFarland, Haley Lepp, Hancheng Cao, Haotian Ye, James Y. Zou, Lingjiao Chen, Sheng Liu, Weixin Liang, Xuandong Zhao, Yaohui Zhang, Zachary Izzo, Zhi Huang.

Figure 1
Figure 1. Figure 1: Shift in Adjective Frequency in ICLR 2024 Peer Reviews. We find a significant shift in the frequency of certain tokens in ICLR 2024, with adjectives such as “com￾mendable”, “meticulous”, and “intricate” showing 9.8, 34.7, and 11.2-fold increases in probability of occurring in a sen￾tence. We find a similar trend in NeurIPS but not in Nature Portfolio journals. Supp [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the method. We begin by generating a corpus of documents with known scientist or AI authorship. Using this historical data, we can estimate the scientist-written and AI text distributions P and Q and validate our method’s performance on held-out data. Finally, we can use the estimated P and Q to estimate the fraction of AI-generated text in a target corpus. framework exhibits moderate robust… view at source ↗
Figure 3
Figure 3. Figure 3: Performance validation of our MLE estimator across ICLR ’23, NeurIPS ’22, and CoRL ’22 reviews (all predating ChatGPT’s launch) via the method described in Section 3.6. Our algorithm demonstrates high accuracy with less than 2.4% prediction error in identifying the proportion of LLM-generated feedback within the validation set. See Supp [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Temporal changes in the estimated α for sev￾eral ML conferences and Nature Portfolio journals. The estimated α for all ML conferences increases sharply after the release of ChatGPT (denoted by the dotted vertical line), indicating that LLMs are being used in a small but signifi￾cant way. Conversely, the α estimates for Nature Portfolio reviews do not exhibit a significant increase or rise above the margin … view at source ↗
Figure 5
Figure 5. Figure 5: Robustness of the estimations to proofread￾ing. Evaluating α after using LLMs for “proof-reading” (non-substantial editing) of peer reviews shows a minor, non-significant increase across conferences, confirming our method’s sensitivity to text which was generated in signifi￾cant part by LLMs, beyond simple proofreading. See Supp [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: The deadline effect. Reviews submitted within 3 days of the review deadline tended to have a higher esti￾mated α. See Supp [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Substantial modification and expansion of incomplete sentences using LLMs can largely account for the observed trend. Rather than directly using LLMs to generate feedback, we expand a bullet-pointed skeleton of incomplete sentences into a full review using LLMs (see Supp [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: The lower reply rate effect. We observe a nega￾tive correlation between number of reviewer replies in the review discussion period and the estimated α on these re￾views. See Supp [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The homogenization effect. “Convergent” re￾views (those most similar to other reviews of the same paper in the embedding space) tend to have a higher estimated α as compared to “divergent” reviews (those most dissimilar to other reviews). See Supp [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The low confidence effect. Reviews with low confidence, defined as self-rated confidence of 2 or lower on a 5-point scale, are correlated with higher alpha values than those with 3 or above, and are mostly identical across these major ML conferences. See the descriptions of the confidence rating scales in Supp [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Word cloud of top 100 adjectives in LLM feedback, with font size indicating frequency. [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Word cloud of top 100 adverbs in LLM feedback, with font size indicating frequency. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Full Results of the validation procedure from Section [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Results of the validation procedure from Section [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Results of the validation procedure from Section [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Temporal changes in the estimated α for several ML conferences using adverbs. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Results of the validation procedure from Section [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Temporal changes in the estimated α for several ML conferences using verbs [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Results of the validation procedure from Section [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Temporal changes in the estimated α for several ML conferences using nouns [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Results of the validation procedure from Section [PITH_FULL_IMAGE:figures/full_fig_p030_22.png] view at source ↗
Figure 23
Figure 23. Figure 23 [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Results of the validation procedure from Section [PITH_FULL_IMAGE:figures/full_fig_p037_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Results of the validation procedure from Section [PITH_FULL_IMAGE:figures/full_fig_p037_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Temporal changes in the estimated α for several ML conferences using the model trained on reviews generated by GPT-3.5. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Example system prompt for generating training data. Paper contents are provided as the user message. [PITH_FULL_IMAGE:figures/full_fig_p040_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Example prompt for generating validation data with prompt shift. Note that although this validation prompt is [PITH_FULL_IMAGE:figures/full_fig_p040_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Example prompt for reverse-engineering a given official review into a skeleton (outline) to simulate how a human [PITH_FULL_IMAGE:figures/full_fig_p041_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Example prompt for elaborating the skeleton (outline) into the full review. The format of a review varies [PITH_FULL_IMAGE:figures/full_fig_p041_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Example prompt for proofreading. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_31.png] view at source ↗
read the original abstract

We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces a maximum likelihood estimation (MLE) framework that uses separate reference corpora of expert-written and LLM-generated texts to estimate the fraction of substantially LLM-modified content in a target corpus. Applied to peer reviews from ICLR 2024, NeurIPS 2023, CoRL 2023, and EMNLP 2023, the model yields an estimated range of 6.5–16.9% LLM-modified text (beyond minor edits). The work also reports correlations between higher estimated LLM use and lower reviewer confidence, later submission times, and lower rebuttal response rates, and discusses broader implications for peer review.

Significance. If the calibration references are representative of the target domain, the corpus-level estimates and behavioral correlations would provide the first large-scale quantitative evidence on LLM adoption in AI conference reviewing and could inform policy on disclosure and detection. The method's efficiency for corpus-scale analysis is a practical strength, but its validity rests on untested assumptions about reference-text similarity.

major comments (1)
  1. [§3 and §4] §3 (Methods) and §4 (Results): The MLE is calibrated exclusively on the chosen expert-written and AI-generated reference texts, yet no quantitative validation (e.g., feature-distribution overlap, perplexity, or domain-adaptation metrics) is reported to confirm that these references share the same vocabulary, syntactic, and technical-density distributions as the peer-review corpus. Systematic mismatch would bias the 6.5–16.9% fraction, directly undermining the central claim.
minor comments (2)
  1. [Abstract] The abstract and introduction should explicitly state the exact definition of “substantially modified” used in the MLE threshold.
  2. [Figures 3–5] Figure captions and axis labels in the behavioral-correlation plots should include sample sizes and confidence intervals for each bin.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment point by point below and commit to revisions where appropriate.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Methods) and §4 (Results): The MLE is calibrated exclusively on the chosen expert-written and AI-generated reference texts, yet no quantitative validation (e.g., feature-distribution overlap, perplexity, or domain-adaptation metrics) is reported to confirm that these references share the same vocabulary, syntactic, and technical-density distributions as the peer-review corpus. Systematic mismatch would bias the 6.5–16.9% fraction, directly undermining the central claim.

    Authors: We agree that explicit quantitative validation of reference-text similarity to the target corpus would strengthen the paper and reduce concerns about potential bias in the MLE estimates. The references were chosen for domain relevance (prior conference reviews and LLM outputs prompted to produce review-style text), but we did not include overlap metrics in the original submission. In the revised manuscript we will add: (1) vocabulary overlap via Jaccard similarity on unigrams and bigrams, (2) perplexity of both reference sets under a held-out RoBERTa model, and (3) distributions of syntactic and technical features (sentence length, POS-tag entropy, and density of AI-conference terms). These will appear in an expanded §3 with a new discussion of any observed mismatches and their possible effect on the 6.5–16.9% range. We view this as a substantive improvement rather than a change to the core method or results. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core estimation uses a maximum likelihood model calibrated exclusively on separate expert-written and AI-generated reference corpora, then applies the fitted model to an independent target corpus of peer reviews. This structure does not reduce the target estimates to the calibration inputs by construction, nor does the abstract or described method invoke self-citations, uniqueness theorems, or ansatzes that would make the result tautological. The derivation remains externally grounded in the reference data and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen reference texts are representative and that the MLE procedure correctly recovers the mixing fraction in unseen peer-review text.

axioms (1)
  • domain assumption Expert-written and AI-generated reference texts are sufficiently representative to calibrate the model for real peer-review text.
    The maximum likelihood procedure uses these references to estimate the unknown fraction in the target corpus.

pith-pipeline@v0.9.0 · 5814 in / 1302 out tokens · 52723 ms · 2026-05-24T02:46:23.583773+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AgentReview: Exploring Peer Review Dynamics with LLM Agents

    cs.CL 2024-06 unverdicted novelty 8.0

    AgentReview is the first LLM-based simulation framework for peer review that quantifies a 37.1% decision variation attributable to reviewer biases.

  2. PeerPrism: Peer Evaluation Expertise vs Review-writing AI

    cs.CL 2026-04 unverdicted novelty 7.0

    PeerPrism benchmark demonstrates that state-of-the-art LLM detectors conflate surface text style with intellectual contribution and fail on hybrid human-AI peer reviews.

  3. The Impact of AI-Generated Text on the Internet

    cs.CY 2026-04 unverdicted novelty 7.0

    By mid-2025 roughly 35% of new websites are AI-generated or AI-assisted, correlating with lower semantic diversity and higher positive sentiment but showing no significant drop in factual accuracy or stylistic diversity.

  4. Detecting Verbatim LLM Copy-Paste in Homework

    cs.CR 2026-05 unverdicted novelty 6.0

    SteganoPrompt embeds a hidden instruction in assignment prompts via the Unicode Tags block so that LLMs add a detectable signature to responses when the prompt is pasted verbatim.

  5. Rethinking Publication: A Certification Framework for AI-Enabled Research

    cs.AI 2026-04 conditional novelty 6.0

    The paper introduces a certification framework that grades AI research contributions into Categories A, B, and C based on pipeline reach at submission time and adds benchmark slots for fully automated work.

  6. Rethinking Publication: A Certification Framework for AI-Enabled Research

    cs.AI 2026-04 unverdicted novelty 6.0

    A two-layer certification framework decouples knowledge validity from human authorship to accommodate AI-enabled research in existing publication systems.

  7. AI Disclosure with DAISY

    cs.HC 2026-04 conditional novelty 6.0

    DAISY is a structured form tool that generates more complete AI disclosure statements for research papers without reducing author comfort levels.

  8. Publish and Perish: How AI-Accelerated Writing Without Proportional Verification Investment Degrades Scientific Knowledge

    physics.soc-ph 2026-04 unverdicted novelty 5.0

    A minimal ODE model of AI adoption in writing and reviewing predicts a short-term knowledge peak followed by 40% long-term decline unless review acceleration exceeds writing acceleration.

  9. Justice in Judgment: Unveiling (Hidden) Bias in LLM-assisted Peer Reviews

    cs.CY 2025-09 unverdicted novelty 5.0

    Controlled prompt interventions reveal strong affiliation bias in LLM peer reviews favoring top-ranked institutions, plus effects from seniority and publication history.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 8 Pith papers · 1 internal anchor

  1. [1]

    org/CorpusID:263831345

    URL https://api.semanticscholar. org/CorpusID:263831345. Bearman, M., Ryan, J., and Ajjawi, R. Discourses of artifi- cial intelligence in higher education: A critical literature review. Higher Education, 86(2):369–385, 2023. Beresneva, D. Computer-Generated Text Detection Us- ing Machine Learning: A Systematic Review. In International Conference on Applic...

  2. [2]

    org/CorpusID:32452685

    URL https://api.semanticscholar. org/CorpusID:32452685. Bhattacharjee, A., Kumarage, T., Moraffah, R., and Liu, H. ConDA: Contrastive Domain Adaptation for AI- generated Text Detection. ArXiv, abs/2309.03992,

  3. [3]

    org/CorpusID:261660497

    URL https://api.semanticscholar. org/CorpusID:261660497. Bommasani, R., Creel, K. A., Kumar, A., Jurafsky, D., and Liang, P. S. Picking on the same person: Does algo- rithmic monoculture lead to outcome homogenization? Advances in Neural Information Processing Systems, 35: 3663–3678, 2022. Cantor, M. Nearly 50 news websites are ‘AI-generated’, a study say...

  4. [4]

    37 Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang

    URL https://api.semanticscholar. org/CorpusID:258686680. Chiang, Y .-L., Chang, L.-P., Hsieh, W.-T., and Chen, W.- C. Natural Language Watermarking Using Semantic Substitution for Chinese Text. In International Workshop on Digital Watermarking, 2003. URL https://api. semanticscholar.org/CorpusID:40971354. 11 Monitoring AI-Modified Content at Scale: A Case...

  5. [5]

    org/CorpusID:260315804

    URL https://api.semanticscholar. org/CorpusID:260315804. Lamont, M. How professors think: Inside the curious world of academic judgment. Harvard University Press, 2009. Lamont, M. Toward a comparative sociology of valuation and evaluation. Annual review of sociology, 38:201–221, 2012. Lavergne, T., Urvoy, T., and Yvon, F. Detect- ing Fake Content with Rel...

  6. [6]

    org/CorpusID:12098535

    URL https://api.semanticscholar. org/CorpusID:12098535. Li, Y ., Li, Q., Cui, L., Bi, W., Wang, L., Yang, L., Shi, S., and Zhang, Y . Deepfake Text Detection in the Wild. ArXiv, abs/2305.13242,

  7. [7]

    org/CorpusID:258832454

    URL https://api.semanticscholar. org/CorpusID:258832454. Liang, W., Yuksekgonul, M., Mao, Y ., Wu, E., and Zou, J. Y . GPT detectors are biased against non-native English writers. ArXiv, abs/2304.02819, 2023a. Liang, W., Zhang, Y ., Cao, H., Wang, B., Ding, D., Yang, X., V odrahalli, K., He, S., Smith, D., Yin, Y ., McFarland, D., and Zou, J. Can large la...

  8. [8]

    org/CorpusID:263830310

    URL https://api.semanticscholar. org/CorpusID:263830310. Liu, Q., Zhou, Y ., Huang, J., and Li, G. When chatgpt is gone: Creativity reverts and homogeneity persists, 2024. Liu, R. and Shah, N. B. Reviewergpt? an exploratory study on using large language models for paper reviewing. arXiv preprint arXiv:2306.00622, 2023. Liu, X., Zhang, Z., Wang, Y ., Lan, ...

  9. [9]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    URL https://api.semanticscholar. org/CorpusID:254877728. Liu, Y ., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V . RoBERTa: A Robustly Optimized BERT Pretraining Ap- proach. ArXiv, abs/1907.11692, 2019. Longino, H. E. Science as social knowledge: Values and objectivity in scientific inquiry. Princet...

  10. [10]

    org/CorpusID:258987266

    URL https://api.semanticscholar. org/CorpusID:258987266. Singh, K. and Zou, J. New Evaluation Metrics Capture Quality Degradation due to LLM Watermarking. arXiv preprint arXiv:2312.02382, 2023. Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert- V oss, A., Wu, J., Radford, A., Krueger, G., Kim, J. W., Kreps, S., et al. Release strategies and the s...

  11. [11]

    org/CorpusID:221835708

    URL https://api.semanticscholar. org/CorpusID:221835708. Van Noorden, R. and Perkel, J. M. Ai and science: what 1,600 researchers think. Nature, 621(7980):672–675, 2023. Van Rossum, D. Generative AI Top 150: The World’s Most Used AI Tools. https://www.flexos. work/learn/generative-ai-top-150 , Febru- ary 2024. Walters, W. H. and Wilder, E. I. Fabrication ...

  12. [12]

    org/CorpusID:263834753

    URL https://api.semanticscholar. org/CorpusID:263834753. Yang, X., Cheng, W., Petzold, L., Wang, W. Y ., and Chen, H. DNA-GPT: Divergent N-Gram Analy- sis for Training-Free Detection of GPT-Generated Text. ArXiv, abs/2305.17359, 2023a. URL https: //api.semanticscholar.org/CorpusID: 258960101. 14 Monitoring AI-Modified Content at Scale: A Case Study on the...

  13. [13]

    Llm paternity test: Generated text detection with llm genetic inheritance,

    URL https://api.semanticscholar. org/CorpusID:259129912. Yu, X., Qi, Y ., Chen, K., Chen, G., Yang, X., Zhu, P., Zhang, W., and Yu, N. H. GPT Pa- ternity Test: GPT Generated Text Detection with GPT Genetic Inheritance. ArXiv, abs/2305.12519,

  14. [14]

    Defending against neural fake news

    URL https://api.semanticscholar. org/CorpusID:258833423. Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y ., Farhadi, A., Roesner, F., and Choi, Y . Defending Against Neural Fake News. ArXiv, abs/1905.12616,

  15. [15]

    org/CorpusID:168169824

    URL https://api.semanticscholar. org/CorpusID:168169824. Zhang, Y .-F., Zhang, Z., Wang, L., Tan, T.-P., and Jin, R. Assaying on the Robustness of Zero-Shot Machine- Generated Text Detectors. ArXiv, abs/2312.12918,

  16. [16]

    Review outline:

    URL https://api.semanticscholar. org/CorpusID:266375086. Zhao, X., Ananth, P. V ., Li, L., and Wang, Y .-X. Provable Robust Watermarking for AI-Generated Text. ArXiv, abs/2306.17439, 2023. URL https: //api.semanticscholar.org/CorpusID: 259308864. 15 Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Review...