Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews
Pith reviewed 2026-05-22 08:56 UTC · model grok-4.3
The pith
Sem-Detect identifies AI-generated peer reviews by comparing their semantic claims to those produced by multiple AI models on the same paper.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sem-Detect operationalizes authorship detection for peer reviews by combining textual features with claim-level semantic analysis. It compares a target review against multiple AI-generated reviews of the same paper and exploits the observation that different AI models converge on similar points while human reviewers introduce more unique and diverse ones. This enables reliable separation of fully AI-generated reviews from authentic human-written ones, including cases where humans refine their text with an LLM, as shown across more than 20,000 reviews from ICLR and NeurIPS.
What carries the argument
Claim-level semantic comparison of the target review against multiple AI-generated reviews of the same paper, which surfaces convergence patterns in AI outputs versus diversity in human ones.
If this is right
- In the binary setting the method raises true-positive rate at 0.1 percent false-positive rate by 25.5 percent over the strongest baseline.
- In the three-class setting fewer than 3.5 percent of LLM-refined human reviews are misclassified as fully AI-generated.
- LLM refinement leaves the underlying semantic signals of human judgment intact and distinct from pure AI patterns.
- The approach scales to large collections of conference reviews drawn from ICLR and NeurIPS.
Where Pith is reading between the lines
- The same divergence signal could be tested on AI-assisted writing in domains such as grant proposals or news articles.
- Review platforms might incorporate this check to route only low-divergence reviews for extra human scrutiny.
- Future AI systems could be prompted or fine-tuned to increase semantic spread deliberately in order to reduce detectability.
- The technique suggests that measuring agreement with a cluster of synthetic outputs is a general way to surface human originality.
Load-bearing premise
Different AI models will keep converging on similar points and claims in reviews, while human reviewers will keep introducing more unique and diverse ones.
What would settle it
If a new generation of AI models produces reviews whose semantic claims show the same spread and uniqueness as human reviews, the reported gains in distinguishing fully AI text would no longer hold.
Figures
read the original abstract
How can we distinguish whether a peer review was written by a human or generated by an AI model? We argue that, in this setting, authorship should not be attributed solely from the textual features of a review, but also from the ideas, judgments, and claims it expresses. To this end, we propose Sem-Detect, an authorship detection method for peer reviews that operationalizes this principle by combining textual features with claim-level semantic analysis. Sem-Detect compares a target review against multiple AI-generated reviews of the same paper, leveraging the observation that different AI models tend to converge on similar points, while human reviewers introduce more unique and diverse ones. As a result, Sem-Detect is able to distinguish fully AI reviews from authentic human-written ones, including those that have been refined using an LLM but still reflect human judgment. Across a dataset of over 20,000 peer reviews from ICLR and NeurIPS conferences, Sem-Detect improves over the strongest baseline by 25.5% in TPR@0.1% FPR in the binary setting. Moreover, in the three-class scenario, we empirically show that LLM refinement preserves the semantic signals of human reviews, which remain distinct from the patterns exhibited by fully AI-generated text; as a result, fewer than 3.5% of LLM-refined human reviews are misclassified as AI-generated.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Sem-Detect, a method for distinguishing AI-generated peer reviews from human-written ones by combining textual features with claim-level semantic analysis. It compares a target review against multiple AI-generated reviews of the same paper, exploiting the observation that AI models converge on similar points while human reviewers produce more diverse claims and judgments. Evaluated on a dataset of over 20,000 peer reviews from ICLR and NeurIPS, Sem-Detect reports a 25.5% improvement in TPR@0.1% FPR over the strongest baseline in the binary setting and fewer than 3.5% misclassification of LLM-refined human reviews as fully AI-generated in the three-class setting.
Significance. If the results hold, this work could meaningfully advance efforts to maintain peer-review integrity in NLP and ML conferences amid rising AI assistance. The semantic-level focus, which targets idea convergence rather than surface style, represents a potentially more robust signal than purely textual detectors, particularly for the practically relevant case of human-refined AI text. The scale of the real-world dataset from top venues is a clear strength.
major comments (2)
- [§4] §4 (Experiments): The manuscript provides no details on dataset construction, labeling procedure, baseline implementations, or controls for selection effects. This absence directly undermines evaluation of the central empirical claim of a 25.5% TPR@0.1% FPR gain on the 20k+ review corpus.
- [§3] §3 (Method): The load-bearing assumption that different AI models converge on similar semantic points while humans introduce greater diversity is stated but not supported by any quantitative analysis (e.g., intra-AI vs. human semantic similarity metrics) independent of the final detection results.
minor comments (2)
- [Abstract] Abstract: Specify the exact size of the dataset and the precise split between ICLR and NeurIPS reviews.
- [§3.2] Figure 1 or §3.2: Ensure the claim-level extraction pipeline is illustrated with a concrete example of a review and its extracted claims.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential significance of Sem-Detect. We address each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The manuscript provides no details on dataset construction, labeling procedure, baseline implementations, or controls for selection effects. This absence directly undermines evaluation of the central empirical claim of a 25.5% TPR@0.1% FPR gain on the 20k+ review corpus.
Authors: We agree that the current manuscript lacks sufficient detail on these aspects, which is necessary for reproducibility and proper evaluation of the results. In the revised version, we will substantially expand Section 4 with a dedicated subsection on dataset construction. This will describe the collection of the over 20,000 reviews from ICLR and NeurIPS (including years and access methods), the exact labeling procedure distinguishing human-written reviews from AI-generated ones (including prompts and models used for generation), full implementation details and hyperparameters for all baselines, and any controls or checks for selection effects such as review length, topic distribution, or venue-specific biases. revision: yes
-
Referee: [§3] §3 (Method): The load-bearing assumption that different AI models converge on similar semantic points while humans introduce greater diversity is stated but not supported by any quantitative analysis (e.g., intra-AI vs. human semantic similarity metrics) independent of the final detection results.
Authors: We acknowledge that providing quantitative support for the core assumption independent of the end-to-end detection performance would make the method more robust. While the reported improvements in TPR@0.1% FPR and the three-class results are consistent with greater semantic convergence among AI-generated reviews, we will add a new analysis subsection in §3. This will include direct measurements of semantic similarity (using sentence-BERT embeddings and claim extraction) across multiple AI-generated versions of the same paper versus human reviews, reporting intra-AI vs. human similarity statistics to validate the assumption separately from the classifier. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's core method defines Sem-Detect as a comparison of a target review to AI-generated reviews of the same paper, using the stated empirical observation that AI outputs converge on similar claims while human reviews are more diverse. This observation is presented as input to the detector rather than derived from it; performance is measured on an external 20k+ review dataset from ICLR/NeurIPS with no equations or parameters shown to be fitted and then renamed as predictions. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known results are merely renamed. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Different AI models tend to converge on similar points for the same paper while human reviewers introduce more unique and diverse ones.
Reference graph
Works this paper leans on
-
[1]
doi: 10.18653/v1/2024.findings-emnlp.377
Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.377. Emi, B. Pangram Predicts 21% of ICLR Reviews are AI- Generated. https://www.pangram.com/blog /pangram-predicts-21-of-iclr-reviews -are-ai-generated, 2025. Accessed: 2025-12-01. Fitzgibbon, A., Leal-Taix ´e, L., and Murino, V . Opening ceremony slides at the European Confe...
-
[2]
doi: 10.18653/v1/2020.acl-main.164
Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.164. Jawahar, G., Abdul-Mageed, M., and Lakshmanan, V .S., L. Automatic detection of machine generated text: A critical survey. In Scott, D., Bel, N., and Zong, C. (eds.),Pro- ceedings of the 28th International Conference on Compu- tational Linguistics, pp. 2296–2309, 2020. Jiang, A...
-
[3]
URL https://aclanthology.org/2024. emnlp-main.1262/. Lavergne, T., Urvoy, T., and Yvon, F. Detecting fake con- tent with relative entropy scoring. InProceedings of the 2008 International Conference on Uncovering Plagia- rism, Authorship and Social Software Misuse - Volume 377, PAN’08, pp. 27–31, Aachen, DEU, 2008. CEUR- WS.org. Li, Y ., Li, Q., Cui, L., B...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
doi: 10.1145/3757667. URL https://doi.or g/10.1145/3757667. Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert- V oss, A., Wu, J., Radford, A., Krueger, G....
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3757667 2025
-
[5]
Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y ., Farhadi, A., Roesner, F., and Choi, Y
URL https://openreview.net/forum ?id=HyZwf1rt4s. Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y ., Farhadi, A., Roesner, F., and Choi, Y . Defending against neural fake news. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alch´e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Proc. of NeurIPS, pp. 9051–9062, 2019. Zhang, Y ., Li, M., Long, D., Zhang,...
-
[6]
Factual Restatement– Summaries or descriptions of what the paper does, its methods, contributions, or results
-
[7]
3.Constructive Input– Actionable suggestions or recommendations for improvement
Evaluation– Judgments of quality, including both strengths (positive evaluations) and weaknesses/limitations (negative evaluations). 3.Constructive Input– Actionable suggestions or recommendations for improvement. 4.Clarification Dialogue– Questions directed to the authors or requests for clarification
-
[8]
Meta-Commentary– Remarks about the broader context, such as fit for the venue, clarity of writing, novelty, or overall recommendation. 14 Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews On Tables 7 and 8 we now illustrate the claim extraction process with a real example: the full original review text is shown first, followed by its extra...
work page 2021
-
[9]
Proportion of High-Similarity: Captures what proportion of a review is highly aligned with the AI-generated references. For each target claim, we check whether its maximum semantic similarity to any claim in the AI-generated references exceeds a threshold τ, and report the fraction of claims that do so. We tune τ via a linear sweep from 0.7 to 0.9 during ...
-
[10]
Mean Similarities Above Threshold: For the subset of claims previously identified as having strong overlap with the AI-generated references, this feature captures how strong that overlap is on average. For all target claims whose maximum semantic similarity to any AI claim exceedsτ, we compute the mean of these maximum similarity values
-
[11]
Mean Best-Match Claim Similarity: Captures the overall semantic proximity of a review to AI-generated content. For each target claim, we compute its best-match semantic similarity to any claim in the AI-generated reference reviews, and then average these best-match similarities across all target claims
-
[12]
Intra-Review Semantic Diversity: Captures how semantically varied the claims within a review are. We compute all pairwise cosine similarities between claim embeddings within the target review and define the feature as one minus their mean, so that higher values correspond to greater semantic diversity and lower redundancy
-
[13]
We compute the natural logarithm of one plus the number of claims extracted from the target review
Log Review Length: Captures the effective length of a review while reducing the influence of very long outliers. We compute the natural logarithm of one plus the number of claims extracted from the target review
-
[14]
We average the entropy of the model’s next-token distribution over all positions in the text
Entropy: Captures uncertainty in the language model’s next-token predictions along the review. We average the entropy of the model’s next-token distribution over all positions in the text
-
[15]
Perplexity: Captures how predictable the review text is under a given language model. Although entropy and perplexity are closely related, we include both features as they capture complementary aspects of the model’s behavior, and we find that including both consistently improves classification performance in practice
-
[16]
Top-k Token Percentage: Captures how often the review follows highly probable token choices under a language model. We compute the fraction of tokens in the target review whose next-token probability lies within the model’s top-kpredictions, usingk= 200
-
[17]
Fast-Detect Score: Captures token-level statistical signals associated with machine-generated text. As an additional textual feature, we include the score produced by Fast-DetectGPT when applied to the target review. 21 Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews While Figure 18 reveals which features the classifier relies on most, i...
work page 2021
-
[18]
Reviews that feature false claims are a code of ethics violation
and Kimi-K2.5 (Kimi-Team, 2026). Table 16.Effect of expanding the training generator pool on the OOD-M+P test set. Variant 1 uses all six families per paper. Variant 2 samples four of the six families per paper, keeping the original class distribution while exposing the classifier to all six families overall. Configuration Macro-F1 AI Prec. AI Rec. Origin...
work page 2026
-
[19]
Not all claims are verifiable. Generic statements like “the document is well-written” cannot be checked against the paper content. We therefore fine-tune a BERT classifier to filter out such claims (≈25% claims are discarded)
-
[20]
The method has not been evaluated on open-source LLMs
We target hallucinations rather than subjective assessment errors. Human reviewers may legitimately misinterpret aspects of a paper. Hallucinations, by contrast, are clear false statements that directly contradict the paper, for example, claiming “The method has not been evaluated on open-source LLMs” when there are experiments clearly reporting them. Our...
work page 2025
-
[21]
The full text of an academic paper
-
[22]
The full peer review
-
[23]
Your task is to determine whether the reviewer has made aHALLUCINATED claimabout the paper
One claim extracted from the peer review. Your task is to determine whether the reviewer has made aHALLUCINATED claimabout the paper. Core Distinction Youmust distinguishbetween: Hallucination:The reviewer asserts false contentas if it were explicitly stated in the paper. Possible Incorrect Claim:The reviewer may be wrong due to interpretation, reasoning,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.