pith. sign in

arxiv: 2605.21713 · v1 · pith:56X75SUAnew · submitted 2026-05-20 · 💻 cs.CL

Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews

Pith reviewed 2026-05-22 08:56 UTC · model grok-4.3

classification 💻 cs.CL
keywords AI generated text detectionpeer review analysissemantic claim comparisonauthorship attributionLLM refinementconference reviewsICLR NeurIPS datasetbinary and multiclass classification
0
0 comments X

The pith

Sem-Detect identifies AI-generated peer reviews by comparing their semantic claims to those produced by multiple AI models on the same paper.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sem-Detect to determine whether a peer review was written by a human or generated by AI. It moves beyond surface-level text features to examine the specific ideas, judgments, and claims expressed in the review. The method generates several AI reviews for the identical paper and measures overlap in semantic points with the target review. Human reviews tend to diverge because they contain more unique perspectives, whereas AI outputs cluster around similar content. This distinction supports efforts to protect the quality of scientific evaluation as AI tools become more common in writing assistance.

Core claim

Sem-Detect operationalizes authorship detection for peer reviews by combining textual features with claim-level semantic analysis. It compares a target review against multiple AI-generated reviews of the same paper and exploits the observation that different AI models converge on similar points while human reviewers introduce more unique and diverse ones. This enables reliable separation of fully AI-generated reviews from authentic human-written ones, including cases where humans refine their text with an LLM, as shown across more than 20,000 reviews from ICLR and NeurIPS.

What carries the argument

Claim-level semantic comparison of the target review against multiple AI-generated reviews of the same paper, which surfaces convergence patterns in AI outputs versus diversity in human ones.

If this is right

  • In the binary setting the method raises true-positive rate at 0.1 percent false-positive rate by 25.5 percent over the strongest baseline.
  • In the three-class setting fewer than 3.5 percent of LLM-refined human reviews are misclassified as fully AI-generated.
  • LLM refinement leaves the underlying semantic signals of human judgment intact and distinct from pure AI patterns.
  • The approach scales to large collections of conference reviews drawn from ICLR and NeurIPS.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same divergence signal could be tested on AI-assisted writing in domains such as grant proposals or news articles.
  • Review platforms might incorporate this check to route only low-divergence reviews for extra human scrutiny.
  • Future AI systems could be prompted or fine-tuned to increase semantic spread deliberately in order to reduce detectability.
  • The technique suggests that measuring agreement with a cluster of synthetic outputs is a general way to surface human originality.

Load-bearing premise

Different AI models will keep converging on similar points and claims in reviews, while human reviewers will keep introducing more unique and diverse ones.

What would settle it

If a new generation of AI models produces reviews whose semantic claims show the same spread and uniqueness as human reviews, the reported gains in distinguishing fully AI text would no longer hold.

Figures

Figures reproduced from arXiv: 2605.21713 by Aditya Oke, Andr\'e V. Duarte, Arlindo L. Oliveira, Brian Tufts, Fei Fang, Lei Li.

Figure 1
Figure 1. Figure 1: Classical AI-text detectors rely on textual features to decide whether a review was written by a human. Sem-Detect instead infers authorship by leveraging the semantic content of expressed ideas, thereby distinguishing fully AI-generated reviews from LLM-refined human ones. large language models (LLMs), there is growing evidence of AI-generated content appearing in peer reviews (Liang et al., 2024; Zhou et… view at source ↗
Figure 2
Figure 2. Figure 2: Sem-Detect pipeline. We construct our dataset by prompting LLMs to generate fully AI reviews from conference papers and to refine authentic human reviews, creating three classes. For classification, each target review (from any class) is paired with multiple AI-generated reference reviews of the same paper. We extract textual features from the target review and semantic features from the target-reference c… view at source ↗
Figure 4
Figure 4. Figure 4: ROC curves for the collapsed binary task. Human and LLM-Refined reviews are grouped against fully AI reviews. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mean best-match claim similarity by class (test set). 5.3. Deployment via Confidence Thresholding By default, Sem-Detect predicts the highest-probability class regardless of certainty. For example, probabilities of 0.51 AI-generated, 0.48 human-written, and 0.01 LLM￾refined still yield an AI-generated label, despite near-tie uncertainty between human and AI, which is undesirable when false accusations are … view at source ↗
Figure 7
Figure 7. Figure 7: Prediction confidence calibration by predicted class. Fortunately, as [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Effect of confidence thresholding on classification accuracy, coverage, and Human→LLM-refined error rate. 5.4. Robustness to Generation Conditions In practice, reviewers may use diverse models and prompts to generate or refine reviews, raising the question of whether Sem-Detect generalizes beyond its training conditions. We evaluate two out-of-distribution settings: (i) OOD-M, where reviews are generated b… view at source ↗
Figure 9
Figure 9. Figure 9: Cross-domain generalization results. F1 scores for Sem￾Detect on the ML test set and the medical imaging venue MIDL 2022. No domain-specific retraining is performed. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sem-Detect and EditLens (Pangram Labs) review au￾thorship predictions on ICLR 2026 data. 5.6. ICLR 2026 Comparison Our evaluations so far have relied on data where ground truth labels are known due to temporal constraints. To examine how Sem-Detect behaves in a contemporary setting, we turn to ICLR 2026, sampling approximately 600 papers at random. This analysis is motivated by recent claims from Pangram … view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of claim types across venues and years. A.2. Generating Fully AI-Reviews [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: summarizes the computational costs for review generation and cleaning. Whenever possible, we use batch API calls to reduce latency and cost. In our case, Gemini models are queried with Gemini Batch API requests2 , while DeepSeek and Qwen-3 use synchronous requests through AWS3 . The generation step (Figure 12a) represents the largest expense, with total costs approaching $170. These costs, however, are di… view at source ↗
Figure 13
Figure 13. Figure 13: Computational cost for claim extraction across review classes, broken down by venue and year [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparison of classifier performance across hy￾perparameter configurations. Each boxplot shows the dis￾tribution of macro-F1 scores on the test set obtained during randomized search [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Effect of the number of reference reviews (k) on three-class detection performance. From [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Analysis of embedding model choice on the three-class detection performance. B.3.1. SCALING EMBEDDING MODEL SIZE We evaluate three variants of Qwen-3 Embedding (Zhang et al., 2025) at different scales: 0.6B, 4B, and 8B parameters. Figure 16a presents the results. While performance improves as model size increases, the benefits plateau at the 4B scale, as the 8B model performs on par with the 4B variant. D… view at source ↗
Figure 17
Figure 17. Figure 17: F1 score comparison between claim-level and sentence-level segmentation strategies [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Relative importance of the nine features as learned by the LightGBM classifier. 1. Proportion of High-Similarity: Captures what proportion of a review is highly aligned with the AI-generated references. For each target claim, we check whether its maximum semantic similarity to any claim in the AI-generated references exceeds a threshold τ , and report the fraction of claims that do so. We tune τ via a lin… view at source ↗
Figure 19
Figure 19. Figure 19: Distribution of semantic and surface-level features across review types. A clear pattern emerges from these distributions. Semantic features primarily separate fully AI-generated reviews from the other two classes, with LLM-refined reviews remaining close to human-written ones. This supports our core hypothesis: polishing a review with an LLM preserves the original human ideas. Textual features, by contra… view at source ↗
Figure 20
Figure 20. Figure 20: presents the results. The combined approach achieves a Macro-F1 score of approximately 0.84, outperforming both the textual-only variant (0.76) and the semantic-only variant (0.59). This performance gap highlights why neither feature type alone is sufficient for reliable three-class detection. Textual-Only Semantic-Only Both (Sem-Detect) Type of Training Features 0.0 0.2 0.4 0.6 0.8 1.0 Macro-F1 Score [P… view at source ↗
Figure 21
Figure 21. Figure 21: Distribution of Macro-F1 scores across all 511 feature subsets, grouped by feature composition: textual-only, semantic-only, and both (semantic + textual). Together, these analyses provide comprehensive evidence that (i) the combination of semantic and textual features is structurally necessary and not an artifact of our particular selection, and (ii) the full feature set performs near-optimally within th… view at source ↗
Figure 22
Figure 22. Figure 22: Binary generalization under distribution shift for Sem-Detect and baselines [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Fraction of reviews classified as AI-generated when human claims are increasingly replaced with AI-generated ones. As shown in [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗
Figure 24
Figure 24. Figure 24 [PITH_FULL_IMAGE:figures/full_fig_p030_24.png] view at source ↗
read the original abstract

How can we distinguish whether a peer review was written by a human or generated by an AI model? We argue that, in this setting, authorship should not be attributed solely from the textual features of a review, but also from the ideas, judgments, and claims it expresses. To this end, we propose Sem-Detect, an authorship detection method for peer reviews that operationalizes this principle by combining textual features with claim-level semantic analysis. Sem-Detect compares a target review against multiple AI-generated reviews of the same paper, leveraging the observation that different AI models tend to converge on similar points, while human reviewers introduce more unique and diverse ones. As a result, Sem-Detect is able to distinguish fully AI reviews from authentic human-written ones, including those that have been refined using an LLM but still reflect human judgment. Across a dataset of over 20,000 peer reviews from ICLR and NeurIPS conferences, Sem-Detect improves over the strongest baseline by 25.5% in TPR@0.1% FPR in the binary setting. Moreover, in the three-class scenario, we empirically show that LLM refinement preserves the semantic signals of human reviews, which remain distinct from the patterns exhibited by fully AI-generated text; as a result, fewer than 3.5% of LLM-refined human reviews are misclassified as AI-generated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Sem-Detect, a method for distinguishing AI-generated peer reviews from human-written ones by combining textual features with claim-level semantic analysis. It compares a target review against multiple AI-generated reviews of the same paper, exploiting the observation that AI models converge on similar points while human reviewers produce more diverse claims and judgments. Evaluated on a dataset of over 20,000 peer reviews from ICLR and NeurIPS, Sem-Detect reports a 25.5% improvement in TPR@0.1% FPR over the strongest baseline in the binary setting and fewer than 3.5% misclassification of LLM-refined human reviews as fully AI-generated in the three-class setting.

Significance. If the results hold, this work could meaningfully advance efforts to maintain peer-review integrity in NLP and ML conferences amid rising AI assistance. The semantic-level focus, which targets idea convergence rather than surface style, represents a potentially more robust signal than purely textual detectors, particularly for the practically relevant case of human-refined AI text. The scale of the real-world dataset from top venues is a clear strength.

major comments (2)
  1. [§4] §4 (Experiments): The manuscript provides no details on dataset construction, labeling procedure, baseline implementations, or controls for selection effects. This absence directly undermines evaluation of the central empirical claim of a 25.5% TPR@0.1% FPR gain on the 20k+ review corpus.
  2. [§3] §3 (Method): The load-bearing assumption that different AI models converge on similar semantic points while humans introduce greater diversity is stated but not supported by any quantitative analysis (e.g., intra-AI vs. human semantic similarity metrics) independent of the final detection results.
minor comments (2)
  1. [Abstract] Abstract: Specify the exact size of the dataset and the precise split between ICLR and NeurIPS reviews.
  2. [§3.2] Figure 1 or §3.2: Ensure the claim-level extraction pipeline is illustrated with a concrete example of a review and its extracted claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential significance of Sem-Detect. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The manuscript provides no details on dataset construction, labeling procedure, baseline implementations, or controls for selection effects. This absence directly undermines evaluation of the central empirical claim of a 25.5% TPR@0.1% FPR gain on the 20k+ review corpus.

    Authors: We agree that the current manuscript lacks sufficient detail on these aspects, which is necessary for reproducibility and proper evaluation of the results. In the revised version, we will substantially expand Section 4 with a dedicated subsection on dataset construction. This will describe the collection of the over 20,000 reviews from ICLR and NeurIPS (including years and access methods), the exact labeling procedure distinguishing human-written reviews from AI-generated ones (including prompts and models used for generation), full implementation details and hyperparameters for all baselines, and any controls or checks for selection effects such as review length, topic distribution, or venue-specific biases. revision: yes

  2. Referee: [§3] §3 (Method): The load-bearing assumption that different AI models converge on similar semantic points while humans introduce greater diversity is stated but not supported by any quantitative analysis (e.g., intra-AI vs. human semantic similarity metrics) independent of the final detection results.

    Authors: We acknowledge that providing quantitative support for the core assumption independent of the end-to-end detection performance would make the method more robust. While the reported improvements in TPR@0.1% FPR and the three-class results are consistent with greater semantic convergence among AI-generated reviews, we will add a new analysis subsection in §3. This will include direct measurements of semantic similarity (using sentence-BERT embeddings and claim extraction) across multiple AI-generated versions of the same paper versus human reviews, reporting intra-AI vs. human similarity statistics to validate the assumption separately from the classifier. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core method defines Sem-Detect as a comparison of a target review to AI-generated reviews of the same paper, using the stated empirical observation that AI outputs converge on similar claims while human reviews are more diverse. This observation is presented as input to the detector rather than derived from it; performance is measured on an external 20k+ review dataset from ICLR/NeurIPS with no equations or parameters shown to be fitted and then renamed as predictions. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known results are merely renamed. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that AI models converge semantically on the same paper while humans diverge; no free parameters, invented entities, or additional axioms are described in the abstract.

axioms (1)
  • domain assumption Different AI models tend to converge on similar points for the same paper while human reviewers introduce more unique and diverse ones.
    This observation is presented as the foundational principle enabling the semantic comparison in Sem-Detect.

pith-pipeline@v0.9.0 · 5790 in / 1299 out tokens · 32085 ms · 2026-05-22T08:56:51.757010+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 2 internal anchors

  1. [1]

    doi: 10.18653/v1/2024.findings-emnlp.377

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.377. Emi, B. Pangram Predicts 21% of ICLR Reviews are AI- Generated. https://www.pangram.com/blog /pangram-predicts-21-of-iclr-reviews -are-ai-generated, 2025. Accessed: 2025-12-01. Fitzgibbon, A., Leal-Taix ´e, L., and Murino, V . Opening ceremony slides at the European Confe...

  2. [2]

    doi: 10.18653/v1/2020.acl-main.164

    Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.164. Jawahar, G., Abdul-Mageed, M., and Lakshmanan, V .S., L. Automatic detection of machine generated text: A critical survey. In Scott, D., Bel, N., and Zong, C. (eds.),Pro- ceedings of the 28th International Conference on Compu- tational Linguistics, pp. 2296–2309, 2020. Jiang, A...

  3. [3]

    DeepSeek-V3 Technical Report

    URL https://aclanthology.org/2024. emnlp-main.1262/. Lavergne, T., Urvoy, T., and Yvon, F. Detecting fake con- tent with relative entropy scoring. InProceedings of the 2008 International Conference on Uncovering Plagia- rism, Authorship and Social Software Misuse - Volume 377, PAN’08, pp. 27–31, Aachen, DEU, 2008. CEUR- WS.org. Li, Y ., Li, Q., Cui, L., B...

  4. [4]

    OpenAI GPT-5 System Card

    doi: 10.1145/3757667. URL https://doi.or g/10.1145/3757667. Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert- V oss, A., Wu, J., Radford, A., Krueger, G....

  5. [5]

    Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y ., Farhadi, A., Roesner, F., and Choi, Y

    URL https://openreview.net/forum ?id=HyZwf1rt4s. Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y ., Farhadi, A., Roesner, F., and Choi, Y . Defending against neural fake news. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alch´e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Proc. of NeurIPS, pp. 9051–9062, 2019. Zhang, Y ., Li, M., Long, D., Zhang,...

  6. [6]

    Factual Restatement– Summaries or descriptions of what the paper does, its methods, contributions, or results

  7. [7]

    3.Constructive Input– Actionable suggestions or recommendations for improvement

    Evaluation– Judgments of quality, including both strengths (positive evaluations) and weaknesses/limitations (negative evaluations). 3.Constructive Input– Actionable suggestions or recommendations for improvement. 4.Clarification Dialogue– Questions directed to the authors or requests for clarification

  8. [8]

    important findings

    Meta-Commentary– Remarks about the broader context, such as fit for the venue, clarity of writing, novelty, or overall recommendation. 14 Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews On Tables 7 and 8 we now illustrate the claim extraction process with a real example: the full original review text is shown first, followed by its extra...

  9. [9]

    Proportion of High-Similarity: Captures what proportion of a review is highly aligned with the AI-generated references. For each target claim, we check whether its maximum semantic similarity to any claim in the AI-generated references exceeds a threshold τ, and report the fraction of claims that do so. We tune τ via a linear sweep from 0.7 to 0.9 during ...

  10. [10]

    For all target claims whose maximum semantic similarity to any AI claim exceedsτ, we compute the mean of these maximum similarity values

    Mean Similarities Above Threshold: For the subset of claims previously identified as having strong overlap with the AI-generated references, this feature captures how strong that overlap is on average. For all target claims whose maximum semantic similarity to any AI claim exceedsτ, we compute the mean of these maximum similarity values

  11. [11]

    Mean Best-Match Claim Similarity: Captures the overall semantic proximity of a review to AI-generated content. For each target claim, we compute its best-match semantic similarity to any claim in the AI-generated reference reviews, and then average these best-match similarities across all target claims

  12. [12]

    Intra-Review Semantic Diversity: Captures how semantically varied the claims within a review are. We compute all pairwise cosine similarities between claim embeddings within the target review and define the feature as one minus their mean, so that higher values correspond to greater semantic diversity and lower redundancy

  13. [13]

    We compute the natural logarithm of one plus the number of claims extracted from the target review

    Log Review Length: Captures the effective length of a review while reducing the influence of very long outliers. We compute the natural logarithm of one plus the number of claims extracted from the target review

  14. [14]

    We average the entropy of the model’s next-token distribution over all positions in the text

    Entropy: Captures uncertainty in the language model’s next-token predictions along the review. We average the entropy of the model’s next-token distribution over all positions in the text

  15. [15]

    Perplexity: Captures how predictable the review text is under a given language model. Although entropy and perplexity are closely related, we include both features as they capture complementary aspects of the model’s behavior, and we find that including both consistently improves classification performance in practice

  16. [16]

    We compute the fraction of tokens in the target review whose next-token probability lies within the model’s top-kpredictions, usingk= 200

    Top-k Token Percentage: Captures how often the review follows highly probable token choices under a language model. We compute the fraction of tokens in the target review whose next-token probability lies within the model’s top-kpredictions, usingk= 200

  17. [17]

    As an additional textual feature, we include the score produced by Fast-DetectGPT when applied to the target review

    Fast-Detect Score: Captures token-level statistical signals associated with machine-generated text. As an additional textual feature, we include the score produced by Fast-DetectGPT when applied to the target review. 21 Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews While Figure 18 reveals which features the classifier relies on most, i...

  18. [18]

    Reviews that feature false claims are a code of ethics violation

    and Kimi-K2.5 (Kimi-Team, 2026). Table 16.Effect of expanding the training generator pool on the OOD-M+P test set. Variant 1 uses all six families per paper. Variant 2 samples four of the six families per paper, keeping the original class distribution while exposing the classifier to all six families overall. Configuration Macro-F1 AI Prec. AI Rec. Origin...

  19. [19]

    the document is well-written

    Not all claims are verifiable. Generic statements like “the document is well-written” cannot be checked against the paper content. We therefore fine-tune a BERT classifier to filter out such claims (≈25% claims are discarded)

  20. [20]

    The method has not been evaluated on open-source LLMs

    We target hallucinations rather than subjective assessment errors. Human reviewers may legitimately misinterpret aspects of a paper. Hallucinations, by contrast, are clear false statements that directly contradict the paper, for example, claiming “The method has not been evaluated on open-source LLMs” when there are experiments clearly reporting them. Our...

  21. [21]

    The full text of an academic paper

  22. [22]

    The full peer review

  23. [23]

    Your task is to determine whether the reviewer has made aHALLUCINATED claimabout the paper

    One claim extracted from the peer review. Your task is to determine whether the reviewer has made aHALLUCINATED claimabout the paper. Core Distinction Youmust distinguishbetween: Hallucination:The reviewer asserts false contentas if it were explicitly stated in the paper. Possible Incorrect Claim:The reviewer may be wrong due to interpretation, reasoning,...