Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text
Pith reviewed 2026-05-08 10:27 UTC · model grok-4.3
The pith
Naively averaging token likelihood scores triggers Simpson's paradox in machine-text detectors by destroying non-uniform local signals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The token-level signal distinguishing human and machine text is non-uniform across the hidden space of the detector model, and naively averaging likelihood-based token scores across regions with fundamentally different statistical structure causes a form of Simpson's paradox: a strong local signal is destroyed by inappropriate aggregation. To correct for this, a learned local calibration step grounded in Bayesian decision theory is introduced. Rather than aggregating raw token scores, lightweight predictors of the score distributions conditioned on position in hidden space are learned first, and calibrated log-likelihood ratios are aggregated instead. This single intervention dramatically, e
What carries the argument
Learned lightweight predictors of score distributions conditioned on hidden-space position, used to replace raw token-score averaging with aggregation of calibrated log-likelihood ratios under Bayesian decision theory.
Load-bearing premise
Lightweight predictors of score distributions conditioned on position in hidden space can be learned reliably and generalize across datasets without introducing new overfitting or selection artifacts.
What would settle it
Applying the local calibration procedure to a fresh dataset or detector yields no AUROC gain, or direct measurement shows that token score distributions are statistically uniform across hidden-space positions.
Figures
read the original abstract
The ability to reliably distinguish human-written text from that generated by large language models is of profound societal importance. The dominant approach to this problem exploits the likelihood hypothesis: that machine-generated text should appear more probable to a detector language model than human-written text. However, we demonstrate that the token-level signal distinguishing human and machine text is non-uniform across the hidden space of the detector model, and naively averaging likelihood-based token scores across regions with fundamentally different statistical structure, as most detectors do, causes a form of Simpson's paradox: a strong local signal is destroyed by inappropriate aggregation. To correct for this, we introduce a learned local calibration step grounded in Bayesian decision theory. Rather than aggregating raw token scores, we first learn lightweight predictors of the score distributions conditioned on position in hidden space, and aggregate calibrated log-likelihood ratios instead. This single intervention dramatically and consistently improves detection performance across all baseline detectors and all datasets we consider. For example, our calibrated variant of Fast-DetectGPT improves AUROC from $0.63$ to $0.85$ on GPT-5.4 text, and a locally-calibrated DMAP detector we introduce achieves state-of-the-art performance across the board. That said, our central contribution is not a new detector, but a precise diagnosis of a significant cause of under-performance of existing detectors and a principled, modular remedy compatible with any token-averaging pipeline. This will serve as a foundation for the community to build upon, with natural avenues including richer distributional models, improved calibration strategies, and principled ensembling with hidden-space geometry signals via the full Bayes-optimal decision rule.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that token-level log-likelihood signals distinguishing human from machine-generated text are non-uniform across regions of the detector model's hidden space, and that naive averaging (as in most existing detectors) induces a form of Simpson's paradox that destroys the local signal. It proposes a modular remedy: learn lightweight predictors of per-region score distributions, then aggregate calibrated log-likelihood ratios derived from a Bayesian decision-theoretic formulation. This intervention is reported to yield large, consistent AUROC gains (e.g., Fast-DetectGPT from 0.63 to 0.85 on GPT-5.4 text) and to enable a new DMAP detector that reaches state-of-the-art performance; the central contribution is framed as the diagnosis plus the generalizable calibration step rather than any single new detector.
Significance. If the reported gains prove robust, the work is significant because it supplies both a precise mechanistic diagnosis of a widespread failure mode in likelihood-based detectors and a lightweight, modular, Bayesian-grounded fix that can be dropped into any token-averaging pipeline. The emphasis on distributional heterogeneity and the explicit grounding in decision theory are strengths that could seed follow-on work on richer distributional models and geometry-aware ensembling. The absence of error bars, sample sizes, and validation details for the learned predictors, however, currently limits the strength of the empirical claims.
major comments (3)
- [Abstract] Abstract: the concrete AUROC improvements (0.63 to 0.85 for Fast-DetectGPT on GPT-5.4 text, plus SOTA claims for DMAP) are presented without error bars, dataset sizes, number of evaluation samples, or any statistical significance tests. Because the central empirical claim rests on these numbers, their robustness cannot be assessed from the given information.
- [Method / local calibration step] The local calibration procedure (lightweight predictors of score distributions conditioned on hidden-space position): the manuscript does not specify how these predictors are trained, regularized, or validated (e.g., whether they are fit on the same data used for AUROC evaluation, whether hold-out or cross-validation is used, or how out-of-distribution stability is tested). This detail is load-bearing for the claim that the gains arise from correcting Simpson's paradox rather than from implicit leakage or overfitting to dataset-specific artifacts.
- [Bayesian decision theory formulation] The Bayesian grounding is invoked to justify calibrated log-likelihood ratios, yet the actual implementation still depends on fitted local predictors whose parameters are learned from data. It is therefore unclear whether the method remains consistent with the stated parameter-free ideal or whether the reported gains are driven by the learned components.
minor comments (2)
- [Section 3] Notation for the hidden-space regions and the conditioning variables should be introduced once and used consistently; several passages use informal descriptions that could be replaced by a single formal definition.
- [Experiments] The manuscript would benefit from an explicit ablation table isolating the contribution of the calibration step versus other implementation choices (e.g., choice of base detector, token selection).
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help improve the clarity and robustness of our work. We address each major comment below and plan to revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the concrete AUROC improvements (0.63 to 0.85 for Fast-DetectGPT on GPT-5.4 text, plus SOTA claims for DMAP) are presented without error bars, dataset sizes, number of evaluation samples, or any statistical significance tests. Because the central empirical claim rests on these numbers, their robustness cannot be assessed from the given information.
Authors: We agree that including error bars, dataset sizes, number of evaluation samples, and statistical significance would strengthen the abstract. In the revision, we will update the abstract to report these details, such as the sample sizes from our evaluation datasets and any p-values or confidence intervals for the AUROC gains. revision: yes
-
Referee: [Method / local calibration step] The local calibration procedure (lightweight predictors of score distributions conditioned on hidden-space position): the manuscript does not specify how these predictors are trained, regularized, or validated (e.g., whether they are fit on the same data used for AUROC evaluation, whether hold-out or cross-validation is used, or how out-of-distribution stability is tested). This detail is load-bearing for the claim that the gains arise from correcting Simpson's paradox rather than from implicit leakage or overfitting to dataset-specific artifacts.
Authors: We acknowledge the need for more detail here. The predictors are trained on a disjoint held-out portion of the data, using cross-validation for model selection and regularization techniques like L2 penalties to prevent overfitting. Out-of-distribution stability is assessed by testing on datasets from different domains. We will expand the methods section with a full description of the training, validation, and testing procedures to demonstrate that the gains are due to the calibration correcting the non-uniform signals rather than data leakage. revision: yes
-
Referee: [Bayesian decision theory formulation] The Bayesian grounding is invoked to justify calibrated log-likelihood ratios, yet the actual implementation still depends on fitted local predictors whose parameters are learned from data. It is therefore unclear whether the method remains consistent with the stated parameter-free ideal or whether the reported gains are driven by the learned components.
Authors: The Bayesian decision-theoretic formulation derives the form of the optimal detector as the log-likelihood ratio under the local distributions, which is parameter-free when distributions are known. Since they are not, we learn lightweight predictors to estimate them, which is a practical realization of the Bayesian approach. This does not contradict the ideal; rather, it implements it. The gains arise from accounting for the heterogeneity in score distributions across hidden space, as diagnosed by the Simpson's paradox analysis. We will add text clarifying this relationship in the revised manuscript. revision: partial
Circularity Check
No significant circularity; empirical calibration method is self-contained
full rationale
The paper diagnoses non-uniform token-score statistics across hidden-space regions as causing Simpson's paradox under naive averaging, then proposes learning lightweight predictors of per-region score distributions to produce calibrated log-likelihood ratios, grounded in Bayesian decision theory. This is an explicit data-driven modeling step rather than a derivation that reduces to its own inputs by construction. No equations or claims equate the final detector performance to a tautological fit or self-citation chain; the reported AUROC gains (e.g., Fast-DetectGPT from 0.63 to 0.85) are presented as empirical outcomes on held-out datasets. The method is modular and compatible with existing pipelines, with no load-bearing self-citations or uniqueness theorems invoked. The central contribution remains an independent diagnosis plus a corrective intervention whose validity rests on external benchmarks rather than internal redefinition.
Axiom & Free-Parameter Ledger
free parameters (1)
- lightweight predictor parameters
axioms (2)
- domain assumption Token-level signal distinguishing human and machine text is non-uniform across hidden space
- domain assumption Lightweight predictors can accurately model conditional score distributions
Reference graph
Works this paper leans on
-
[1]
GLTR: Statistical Detection and Visualization of Generated Text
Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush. GLTR: Statistical Detection and Visualization of Generated Text. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 111–116, 2019
work page 2019
-
[2]
DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature
Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. In International Conference on Machine Learning, pages 24950–24962. PMLR, 2023
work page 2023
-
[3]
Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[4]
Spotting LLMs With Binoculars: Zero- Shot Detection of Machine-Generated Text
Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Spotting LLMs With Binoculars: Zero- Shot Detection of Machine-Generated Text. InProceedings of the 41st International Conference on Machine Learning, pages 17519–17537, 2024
work page 2024
-
[5]
DetectLLM: Leveraging Log Rank Information for Zero-Shot Detection of Machine-Generated Text
Jinyan Su, Terry Zhuo, Di Wang, and Preslav Nakov. DetectLLM: Leveraging Log Rank Information for Zero-Shot Detection of Machine-Generated Text. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 12395–12412, 2023
work page 2023
-
[6]
Cong Zeng, Shengkun Tang, Xianjun Yang, Yuanzhou Chen, Yiyou Sun, Zhiqiang Xu, Yao Li, Haifeng Chen, Wei Cheng, and Dongkuan Xu. DALD: Improving Logits-based Detector without Logits from Black-box LLMs.Advances in Neural Information Processing Systems, 37:54947–54973, 2024
work page 2024
-
[7]
DMAP: A Distribution Map for Text
Tom Kempton, Julia Rozanova, Parameswaran Kamalaruban, Maeve Madigan, Karolina Wre- silo, Yoann Launay, David Sutton, and Stuart Burrell. DMAP: A Distribution Map for Text. In The Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[8]
TempTest: Local Normalization Distortion and the Detection of Machine-generated Text
Tom Kempton, Stuart Burrell, and Connor J Cheverall. TempTest: Local Normalization Distortion and the Detection of Machine-generated Text. InInternational Conference on Artificial Intelligence and Statistics, pages 1972–1980. PMLR, 2025
work page 1972
-
[9]
OpenAI. GPT-5 system card. Technical report, OpenAI, August 2025. arXiv:2601.03267
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Google DeepMind. Gemini 3.1 Pro Model Card. Technical report, Google DeepMind, February 2026
work page 2026
-
[11]
Anthropic. Claude Sonnet 4.6 System Card. Technical report, Anthropic, February 2026
work page 2026
-
[12]
Calibrating Language Models with Adaptive Temperature Scaling
Johnathan Xie, Annie S Chen, Yoonho Lee, Eric Mitchell, and Chelsea Finn. Calibrating Language Models with Adaptive Temperature Scaling. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18128–18138, 2024
work page 2024
-
[13]
Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Ziyang Luo, Di Wang, Min Yang, Lidia S Chao, and Derek F Wong. RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns.Transactions of the Association for Computational Linguistics, 13:1812–1831, 2025
work page 2025
-
[14]
RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors
Liam Dugan, Alyssa Hwang, Filip Trhlík, Andrew Zhu, Josh Magnus Ludan, Hainiu Xu, Daphne Ippolito, and Chris Callison-Burch. RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12463–12492, 2024
work page 2024
-
[15]
Ke Sun, Guangsheng Bao, Han Cui, and Yue Zhang. When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection.arXiv preprint arXiv:2601.04833, 2026. 10
-
[16]
SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis
Haitong Luo, Weiyao Zhang, Suhang Wang, Wenji Zou, Chungang Lin, Xuying Meng, and Yujun Zhang. SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32356–32364, 2026
work page 2026
-
[17]
Eduard Tulchinskii, Kristian Kuznetsov, Laida Kushnareva, Daniil Cherniavskii, Sergey Nikolenko, Evgeny Burnaev, Serguei Barannikov, and Irina Piontkovskaya. Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[18]
Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review
Sungduk Yu, Man Luo, Avinash Madasu, Vasudev Lal, and Phillip Howard. Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[19]
Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008. A Why one should expect naive token-score aggregation to underperform In the introduction we made three claims motivating Simpson’s paradox based on previous results in the literature. In this section we expand on those cla...
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.