Recognition: unknown
Luminol-AIDetect: Fast Zero-shot Machine-Generated Text Detection based on Perplexity under Text Shuffling
Pith reviewed 2026-05-07 16:00 UTC · model grok-4.3
The pith
Machine-generated text exhibits a distinct dispersion in perplexity after random shuffling, unlike the stable variability of human text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Luminol-AIDetect shows that randomized text shuffling produces a characteristic dispersion in perplexity for machine-generated text that differs markedly from the more stable structural variability of human-written text. The resulting perplexity shift supplies model-agnostic scalar features that, when fed to density estimation and ensemble prediction, yield zero-shot detection decisions.
What carries the argument
Randomized text shuffling that disrupts local coherence and measures the resulting change in perplexity to expose autoregressive structural fragility.
If this is right
- Detection requires no training data from specific generators and runs with only a handful of perplexity computations.
- Performance holds across eight content domains, eleven adversarial attack types, and eighteen languages.
- False-positive rates drop by up to a factor of seventeen relative to prior zero-shot detectors.
- Computational cost remains lower than earlier statistical or learning-based alternatives.
Where Pith is reading between the lines
- The same shuffling signal could be tested as an auxiliary feature inside existing detectors that already use token statistics.
- If future autoregressive models close the observed fragility gap, the method would lose discriminative power and require replacement signals.
- The approach suggests a practical way to audit new generators for the same structural property without needing white-box access.
Load-bearing premise
The autoregressive generation process creates a structural fragility that random shuffling reliably exposes and that remains distinguishable from the variability of human text in every domain and language.
What would settle it
A large test set in which the distribution of perplexity shifts after shuffling is statistically identical for machine-generated and human-written texts would falsify the claimed discriminant.
Figures
read the original abstract
Machine-generated text (MGT) detection requires identifying structurally invariant signals across generation models, rather than relying on model-specific fingerprints. In this respect, we hypothesize that while large language models excel at local semantic consistency, their autoregressive nature results in a specific kind of structural fragility compared to human writing. We propose Luminol-AIDetect, a novel, zero-shot statistical approach that exposes this fragility through coherence disruption. By applying a simple randomized text-shuffling procedure, we demonstrate that the resulting shift in perplexity serves as a principled, model-agnostic discriminant, as MGT displays a characteristic dispersion in perplexity-under-shuffling that differs markedly from the more stable structural variability of human-written text. Luminol-AIDetect leverages this distinction to inform its decision process, where a handful of perplexity-based scalar features are extracted from an input text and its shuffled version, then detection is performed via density estimation and ensemble-based prediction. Evaluated across 8 content domains, 11 adversarial attack types, and 18 languages, Luminol-AIDetect demonstrates state-of-the-art performance, with gains up to 17x lower FPR while being cheaper than prior methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Luminol-AIDetect, a zero-shot statistical detector for machine-generated text (MGT) that exploits differences in perplexity dispersion under randomized text shuffling. It hypothesizes that autoregressive LLMs produce text with characteristic structural fragility, leading to greater variability in perplexity after shuffling compared to the more stable variability in human text. A small set of scalar features derived from the perplexity of the original and shuffled versions of an input text are fed to density estimation and ensemble-based prediction for classification. The method is evaluated across 8 content domains, 11 adversarial attack types, and 18 languages, with claims of state-of-the-art performance including up to 17x lower false-positive rates and lower computational cost than prior approaches.
Significance. If the empirical results hold under scrutiny, the work provides a fast, model-agnostic, zero-shot alternative to existing detectors that could be practically useful when the generating model is unknown or training data is unavailable. The shuffling-based probe offers a simple, interpretable way to surface autoregressive coherence differences, and the broad multi-domain, multi-attack, and multilingual evaluation is a positive aspect that supports claims of generalizability. The approach adds to the set of structural signals for MGT detection without requiring learned parameters or model access.
major comments (2)
- [§3] §3 (Method description): the manuscript refers to extracting 'a handful of perplexity-based scalar features' from the original and shuffled text but does not enumerate or define these features (e.g., mean/variance of perplexity shift, specific quantiles, or other statistics), nor does it specify the exact density estimation procedure or ensemble construction; these omissions are load-bearing for reproducing the central discriminant and verifying that the reported performance stems from the hypothesized shuffling effect rather than implementation details.
- [§4] §4 (Experimental results): the claim of 'state-of-the-art performance' with 'gains up to 17x lower FPR' is presented without tabulated baseline FPR values, confidence intervals, or statistical significance tests for the key comparisons; this makes it impossible to assess whether the improvement is robust or driven by particular baselines, which is central to the paper's empirical contribution.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a brief equation or pseudocode snippet illustrating how the perplexity shift is quantified, to make the core idea immediately accessible.
- [§4] The evaluation section should explicitly list the 18 languages and the 11 attack types (or cite a table) so readers can judge coverage without external lookup.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation of minor revision. The comments highlight opportunities to enhance reproducibility and empirical rigor, which we address below.
read point-by-point responses
-
Referee: [§3] §3 (Method description): the manuscript refers to extracting 'a handful of perplexity-based scalar features' from the original and shuffled text but does not enumerate or define these features (e.g., mean/variance of perplexity shift, specific quantiles, or other statistics), nor does it specify the exact density estimation procedure or ensemble construction; these omissions are load-bearing for reproducing the central discriminant and verifying that the reported performance stems from the hypothesized shuffling effect rather than implementation details.
Authors: We agree that the method section requires greater specificity for reproducibility. In the revised manuscript, we will enumerate the exact scalar features (mean and variance of the perplexity shift, 25th/50th/75th quantiles of the shuffled perplexities, and the ratio of original to mean-shuffled perplexity) and fully specify the density estimation (kernel density estimation with Gaussian kernel and Scott's rule bandwidth) along with the ensemble procedure (five independent density models combined by averaging posterior probabilities). These additions will confirm that the discriminant relies on the shuffling-induced dispersion rather than ancillary implementation choices. revision: yes
-
Referee: [§4] §4 (Experimental results): the claim of 'state-of-the-art performance' with 'gains up to 17x lower FPR' is presented without tabulated baseline FPR values, confidence intervals, or statistical significance tests for the key comparisons; this makes it impossible to assess whether the improvement is robust or driven by particular baselines, which is central to the paper's empirical contribution.
Authors: The current results section reports aggregate metrics and selected comparisons but lacks the requested tabular detail. In revision we will add a table of per-domain and per-attack FPR values for Luminol-AIDetect and all baselines, include 95% bootstrap confidence intervals on the reported FPR reductions, and apply paired statistical tests (McNemar for classification outcomes) to establish significance of the up to 17x gains. This will allow direct evaluation of robustness across the 8 domains and 11 attacks. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper advances a hypothesis that randomized shuffling exposes autoregressive fragility in MGT via measurable perplexity dispersion, distinct from human text stability. This is implemented by extracting scalar features from original and shuffled text, followed by density estimation for classification. The approach relies on external perplexity computation and standard statistical techniques, with empirical validation across domains, attacks, and languages. No equations or steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the central discriminant is an independent empirical claim rather than a renaming or tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[2]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. 2024. Fast-detect GPT : Efficient zero-shot detection of machine-generated text via conditional probability curvature. In The Twelfth International Conference on Learning Representations
2024
-
[4]
Amrita Bhattacharjee, Tharindu Kumarage, Raha Moraffah, and Huan Liu. 2023. https://doi.org/10.18653/v1/2023.ijcnlp-main.40 C on DA : Contrastive domain adaptation for AI -generated text detection . In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association...
-
[5]
Amrita Bhattacharjee and Huan Liu. 2024. https://doi.org/10.1145/3655103.3655106 Fighting fire with fire: Can chatgpt detect ai-generated text? SIGKDD Explor. Newsl., 25(2):14–21
- [6]
-
[7]
Canyu Chen and Kai Shu. 2024. https://openreview.net/forum?id=ccxD4mtkTU Can LLM -generated misinformation be detected? In The Twelfth International Conference on Learning Representations
2024
-
[8]
Liam Dugan, Alyssa Hwang, Filip Trhl \'i k, Andrew Zhu, Josh Magnus Ludan, Hainiu Xu, Daphne Ippolito, and Chris Callison-Burch. 2024. https://doi.org/10.18653/v1/2024.acl-long.674 RAID : A shared benchmark for robust evaluation of machine-generated text detectors . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics...
-
[9]
Sebastian Gehrmann, Hendrik Strobelt, and Alexander Rush. 2019. https://doi.org/10.18653/v1/P19-3019 GLTR : Statistical detection and visualization of generated text . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 111--116, Florence, Italy. Association for Computational Linguistics
-
[10]
Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2024. Spotting llms with binoculars: zero-shot detection of machine-generated text. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org
2024
-
[11]
Xinlei He, Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. 2024. https://doi.org/10.1145/3658644.3670344 Mgtbench: Benchmarking machine-generated text detection . In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS '24, page 2251–2265, New York, NY, USA. Association for Computing Machinery
-
[12]
Beizhe Hu, Qiang Sheng, Juan Cao, Yang Li, and Danding Wang. 2025. https://doi.org/10.1145/3726302.3730027 Llm-generated fake news induces truth decay in news ecosystem: A case study on neural news recommendation . In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '25, page 435–445, N...
-
[13]
Ganesh Jawahar, Muhammad Abdul-Mageed, and Laks Lakshmanan, V.S. 2020. https://doi.org/10.18653/v1/2020.coling-main.208 Automatic detection of machine generated text: A critical survey . In Proceedings of the 28th International Conference on Computational Linguistics, pages 2296--2309, Barcelona, Spain (Online). International Committee on Computational Li...
-
[14]
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2026. https://doi.org/10.1145/3747588 A survey on large language models for code generation . ACM Trans. Softw. Eng. Methodol., 35(2)
-
[15]
Zae Myung Kim, Kwang Lee, Preston Zhu, Vipul Raheja, and Dongyeop Kang. 2024. https://doi.org/10.18653/v1/2024.acl-long.298 Threads of subtlety: Detecting machine-generated texts through discourse motifs . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5449--5474, Bangkok, Thailand...
- [16]
-
[17]
Lucio La Cava , Davide Costa, and Andrea Tagarelli. 2024. https://doi.org/10.3233/FAIA240862 Is Contrasting All You Need? Contrastive Learning for the Detection and Attribution of AI-generated Text . In ECAI 2024 - 27th European Conference on Artificial Intelligence, 19-24 October 2024, Santiago de Compostela, Spain , volume 392 of Frontiers in Artificial...
-
[18]
Lucio La Cava , Dominik Macko, R \' o bert M \' o ro, Ivan Srba, and Andrea Tagarelli. 2025. https://doi.org/10.48550/ARXIV.2508.01656 Authorship attribution in multilingual machine-generated texts . CoRR, abs/2508.01656
-
[19]
Lucio La Cava and Andrea Tagarelli. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1354 O pen T uring B ench: An open-model-based benchmark and framework for machine-generated text detection and attribution . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26655--26671, Suzhou, China. Association for Com...
- [20]
-
[21]
Dominik Macko, Jakub Kopal, Robert Moro, and Ivan Srba. 2025. https://doi.org/10.5281/zenodo.15519413 Multitudev3 . Zenodo, 15519413
-
[22]
Dominik Macko, Robert Moro, Adaku Uchendu, Jason Lucas, Michiharu Yamashita, Mat \'u s Pikuliak, Ivan Srba, Thai Le, Dongwon Lee, Jakub Simko, and Maria Bielikova. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.616 MULTIT u DE : Large-scale multilingual machine-generated text detection benchmark . In Proceedings of the 2023 Conference on Empirical Meth...
-
[23]
Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. 2023. https://proceedings.mlr.press/v202/mitchell23a.html D etect GPT : Zero-shot machine-generated text detection using probability curvature . In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Rese...
2023
-
[24]
Shakked Noy and Whitney Zhang. 2023. https://doi.org/10.1126/science.adh2586 Experimental evidence on the productivity effects of generative artificial intelligence . Science, 381(6654):187--192
- [25]
-
[26]
Giuseppe Russo, Manoel Horta Ribeiro, Tim Ruben Davidson, Veniamin Veselovsky, and Robert West. 2025. https://doi.org/10.1145/3757667 The ai review lottery: Widespread ai-assisted peer reviews boost paper scores and acceptance rates . Proc. ACM Hum.-Comput. Interact., 9(7)
-
[27]
Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, and 1 others. 2019. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203
work page internal anchor Pith review arXiv 2019
-
[28]
Jinyan Su, Terry Zhuo, Di Wang, and Preslav Nakov. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.827 D etect LLM : Leveraging log rank information for zero-shot detection of machine-generated text . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12395--12412, Singapore. Association for Computational Linguistics
-
[29]
Ruixiang Tang, Yu-Neng Chuang, and Xia Hu. 2024. The science of detecting llm-generated text. Communications of the ACM, 67(4):50--59
2024
-
[30]
Adaku Uchendu, Thai Le, and Dongwon Lee. 2024. https://doi.org/10.3233/FAIA240647 Topformer: Topology-aware authorship attribution of deepfake texts with diverse writing styles . In ECAI 2024 - 27th European Conference on Artificial Intelligence, 19-24 October 2024, Santiago de Compostela, Spain , volume 392 of Frontiers in Artificial Intelligence and App...
- [31]
-
[32]
Saranya Venkatraman, Adaku Uchendu, and Dongwon Lee. 2024. https://doi.org/10.18653/v1/2024.findings-naacl.8 GPT -who: An information density-based machine-generated text detector . In Findings of the Association for Computational Linguistics: NAACL 2024, pages 103--115, Mexico City, Mexico. Association for Computational Linguistics
-
[33]
Vivek Verma, Eve Fleisig, Nicholas Tomlin, and Dan Klein. 2024. https://doi.org/10.18653/v1/2024.naacl-long.95 Ghostbuster: Detecting text ghostwritten by large language models . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1...
-
[34]
Pengyu Wang, Linyang Li, Ke Ren, Botian Jiang, Dong Zhang, and Xipeng Qiu. 2023. https://aclanthology.org/2023.emnlp-main.73/ S eq XGPT : Sentence-level AI -generated text detection . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1144--1156, Singapore. Association for Computational Linguistics
2023
-
[35]
Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Lidia Sam Chao, and Derek Fai Wong. 2025. https://doi.org/10.1162/coli_a_00549 A survey on LLM -generated text detection: Necessity, methods, and future directions . Computational Linguistics, 51(1):275--338
-
[36]
Yu Zhang, Xiusi Chen, Bowen Jin, Sheng Wang, Shuiwang Ji, Wei Wang, and Jiawei Han. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.498 A comprehensive survey of scientific large language models and their applications in scientific discovery . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8783--8817, Mi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.