pith. sign in

arxiv: 2605.06294 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.AI· cs.LG

Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text

Pith reviewed 2026-05-08 10:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords machine-generated text detectionSimpson's paradoxlog-likelihood averaginglocal calibrationBayesian decision theorytoken-level signalshidden space geometry
0
0 comments X

The pith

Naively averaging token likelihood scores triggers Simpson's paradox in machine-text detectors by destroying non-uniform local signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current detectors for machine-generated text average token log-likelihood scores from a language model, but the paper demonstrates that the distinguishing signal between human and machine text varies across regions of the model's hidden space. This non-uniformity means global averaging mixes fundamentally different statistical structures and erases strong local signals, producing a Simpson's paradox effect. The authors introduce a modular remedy: lightweight predictors learn the score distributions conditioned on hidden-space position, after which calibrated log-likelihood ratios are aggregated according to Bayesian decision theory. The intervention improves every baseline detector tested and every dataset examined. The core contribution is the diagnosis of the aggregation flaw together with a principled, add-on calibration step rather than an entirely new detector architecture.

Core claim

The token-level signal distinguishing human and machine text is non-uniform across the hidden space of the detector model, and naively averaging likelihood-based token scores across regions with fundamentally different statistical structure causes a form of Simpson's paradox: a strong local signal is destroyed by inappropriate aggregation. To correct for this, a learned local calibration step grounded in Bayesian decision theory is introduced. Rather than aggregating raw token scores, lightweight predictors of the score distributions conditioned on position in hidden space are learned first, and calibrated log-likelihood ratios are aggregated instead. This single intervention dramatically, e

What carries the argument

Learned lightweight predictors of score distributions conditioned on hidden-space position, used to replace raw token-score averaging with aggregation of calibrated log-likelihood ratios under Bayesian decision theory.

Load-bearing premise

Lightweight predictors of score distributions conditioned on position in hidden space can be learned reliably and generalize across datasets without introducing new overfitting or selection artifacts.

What would settle it

Applying the local calibration procedure to a fresh dataset or detector yields no AUROC gain, or direct measurement shows that token score distributions are statistically uniform across hidden-space positions.

Figures

Figures reproduced from arXiv: 2605.06294 by Maeve Madigan, Stuart Burrell, Tom Kempton, Viktor Drobnyi.

Figure 1
Figure 1. Figure 1: Performance of machine-generated text detectors appears to be degrading over time. These view at source ↗
Figure 2
Figure 2. Figure 2: Upper: A single illustrative example showing how the geometry of the log-likelihood landscape can vary across hidden space and source (human vs. GPT-5.4). Lower: Systematic analysis of 4,000 texts from RAID [14]. While human text has lower average log-likelihood than GPT-5.4 (−3.22 vs. −3.05), this relationship reverses in some regions of hidden space, and variation across 50 k-means clusters exceeds varia… view at source ↗
Figure 3
Figure 3. Figure 3: Left: Standard detection pipelines aggregate raw per-token scores directly, which may yield heavily overlapping distributions for human and modern GPT-5.4 text (AUROC 0.56). Right: Our pipeline inserts a local calibration step before aggregation, producing well-separated distributions (AUROC 0.91). The improvement reflects a Simpson’s paradox: token scores have heterogeneous local statistics across hidden-… view at source ↗
Figure 4
Figure 4. Figure 4: DMAP plots of human-written and GPT-5.4-generated text from the Modern-Raid dataset view at source ↗
Figure 5
Figure 5. Figure 5: Z-scores of log-surprisal of GPT-5.4 generated text relative to learned calibrating Gaussians. view at source ↗
read the original abstract

The ability to reliably distinguish human-written text from that generated by large language models is of profound societal importance. The dominant approach to this problem exploits the likelihood hypothesis: that machine-generated text should appear more probable to a detector language model than human-written text. However, we demonstrate that the token-level signal distinguishing human and machine text is non-uniform across the hidden space of the detector model, and naively averaging likelihood-based token scores across regions with fundamentally different statistical structure, as most detectors do, causes a form of Simpson's paradox: a strong local signal is destroyed by inappropriate aggregation. To correct for this, we introduce a learned local calibration step grounded in Bayesian decision theory. Rather than aggregating raw token scores, we first learn lightweight predictors of the score distributions conditioned on position in hidden space, and aggregate calibrated log-likelihood ratios instead. This single intervention dramatically and consistently improves detection performance across all baseline detectors and all datasets we consider. For example, our calibrated variant of Fast-DetectGPT improves AUROC from $0.63$ to $0.85$ on GPT-5.4 text, and a locally-calibrated DMAP detector we introduce achieves state-of-the-art performance across the board. That said, our central contribution is not a new detector, but a precise diagnosis of a significant cause of under-performance of existing detectors and a principled, modular remedy compatible with any token-averaging pipeline. This will serve as a foundation for the community to build upon, with natural avenues including richer distributional models, improved calibration strategies, and principled ensembling with hidden-space geometry signals via the full Bayes-optimal decision rule.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that token-level log-likelihood signals distinguishing human from machine-generated text are non-uniform across regions of the detector model's hidden space, and that naive averaging (as in most existing detectors) induces a form of Simpson's paradox that destroys the local signal. It proposes a modular remedy: learn lightweight predictors of per-region score distributions, then aggregate calibrated log-likelihood ratios derived from a Bayesian decision-theoretic formulation. This intervention is reported to yield large, consistent AUROC gains (e.g., Fast-DetectGPT from 0.63 to 0.85 on GPT-5.4 text) and to enable a new DMAP detector that reaches state-of-the-art performance; the central contribution is framed as the diagnosis plus the generalizable calibration step rather than any single new detector.

Significance. If the reported gains prove robust, the work is significant because it supplies both a precise mechanistic diagnosis of a widespread failure mode in likelihood-based detectors and a lightweight, modular, Bayesian-grounded fix that can be dropped into any token-averaging pipeline. The emphasis on distributional heterogeneity and the explicit grounding in decision theory are strengths that could seed follow-on work on richer distributional models and geometry-aware ensembling. The absence of error bars, sample sizes, and validation details for the learned predictors, however, currently limits the strength of the empirical claims.

major comments (3)
  1. [Abstract] Abstract: the concrete AUROC improvements (0.63 to 0.85 for Fast-DetectGPT on GPT-5.4 text, plus SOTA claims for DMAP) are presented without error bars, dataset sizes, number of evaluation samples, or any statistical significance tests. Because the central empirical claim rests on these numbers, their robustness cannot be assessed from the given information.
  2. [Method / local calibration step] The local calibration procedure (lightweight predictors of score distributions conditioned on hidden-space position): the manuscript does not specify how these predictors are trained, regularized, or validated (e.g., whether they are fit on the same data used for AUROC evaluation, whether hold-out or cross-validation is used, or how out-of-distribution stability is tested). This detail is load-bearing for the claim that the gains arise from correcting Simpson's paradox rather than from implicit leakage or overfitting to dataset-specific artifacts.
  3. [Bayesian decision theory formulation] The Bayesian grounding is invoked to justify calibrated log-likelihood ratios, yet the actual implementation still depends on fitted local predictors whose parameters are learned from data. It is therefore unclear whether the method remains consistent with the stated parameter-free ideal or whether the reported gains are driven by the learned components.
minor comments (2)
  1. [Section 3] Notation for the hidden-space regions and the conditioning variables should be introduced once and used consistently; several passages use informal descriptions that could be replaced by a single formal definition.
  2. [Experiments] The manuscript would benefit from an explicit ablation table isolating the contribution of the calibration step versus other implementation choices (e.g., choice of base detector, token selection).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help improve the clarity and robustness of our work. We address each major comment below and plan to revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the concrete AUROC improvements (0.63 to 0.85 for Fast-DetectGPT on GPT-5.4 text, plus SOTA claims for DMAP) are presented without error bars, dataset sizes, number of evaluation samples, or any statistical significance tests. Because the central empirical claim rests on these numbers, their robustness cannot be assessed from the given information.

    Authors: We agree that including error bars, dataset sizes, number of evaluation samples, and statistical significance would strengthen the abstract. In the revision, we will update the abstract to report these details, such as the sample sizes from our evaluation datasets and any p-values or confidence intervals for the AUROC gains. revision: yes

  2. Referee: [Method / local calibration step] The local calibration procedure (lightweight predictors of score distributions conditioned on hidden-space position): the manuscript does not specify how these predictors are trained, regularized, or validated (e.g., whether they are fit on the same data used for AUROC evaluation, whether hold-out or cross-validation is used, or how out-of-distribution stability is tested). This detail is load-bearing for the claim that the gains arise from correcting Simpson's paradox rather than from implicit leakage or overfitting to dataset-specific artifacts.

    Authors: We acknowledge the need for more detail here. The predictors are trained on a disjoint held-out portion of the data, using cross-validation for model selection and regularization techniques like L2 penalties to prevent overfitting. Out-of-distribution stability is assessed by testing on datasets from different domains. We will expand the methods section with a full description of the training, validation, and testing procedures to demonstrate that the gains are due to the calibration correcting the non-uniform signals rather than data leakage. revision: yes

  3. Referee: [Bayesian decision theory formulation] The Bayesian grounding is invoked to justify calibrated log-likelihood ratios, yet the actual implementation still depends on fitted local predictors whose parameters are learned from data. It is therefore unclear whether the method remains consistent with the stated parameter-free ideal or whether the reported gains are driven by the learned components.

    Authors: The Bayesian decision-theoretic formulation derives the form of the optimal detector as the log-likelihood ratio under the local distributions, which is parameter-free when distributions are known. Since they are not, we learn lightweight predictors to estimate them, which is a practical realization of the Bayesian approach. This does not contradict the ideal; rather, it implements it. The gains arise from accounting for the heterogeneity in score distributions across hidden space, as diagnosed by the Simpson's paradox analysis. We will add text clarifying this relationship in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical calibration method is self-contained

full rationale

The paper diagnoses non-uniform token-score statistics across hidden-space regions as causing Simpson's paradox under naive averaging, then proposes learning lightweight predictors of per-region score distributions to produce calibrated log-likelihood ratios, grounded in Bayesian decision theory. This is an explicit data-driven modeling step rather than a derivation that reduces to its own inputs by construction. No equations or claims equate the final detector performance to a tautological fit or self-citation chain; the reported AUROC gains (e.g., Fast-DetectGPT from 0.63 to 0.85) are presented as empirical outcomes on held-out datasets. The method is modular and compatible with existing pipelines, with no load-bearing self-citations or uniqueness theorems invoked. The central contribution remains an independent diagnosis plus a corrective intervention whose validity rests on external benchmarks rather than internal redefinition.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Only abstract available; ledger populated from stated claims. The central claim rests on the existence of distinct statistical regimes in hidden space and the learnability of local score distributions.

free parameters (1)
  • lightweight predictor parameters
    Parameters of the learned local predictors of score distributions; fitted to data and required for calibration.
axioms (2)
  • domain assumption Token-level signal distinguishing human and machine text is non-uniform across hidden space
    Explicitly stated as the premise that makes naive averaging invalid.
  • domain assumption Lightweight predictors can accurately model conditional score distributions
    Required for the calibration step to be effective.

pith-pipeline@v0.9.0 · 5606 in / 1444 out tokens · 36178 ms · 2026-05-08T10:27:23.165375+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

  1. [1]

    GLTR: Statistical Detection and Visualization of Generated Text

    Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush. GLTR: Statistical Detection and Visualization of Generated Text. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 111–116, 2019

  2. [2]

    DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature

    Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. In International Conference on Machine Learning, pages 24950–24962. PMLR, 2023

  3. [3]

    Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature

    Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature. InThe Twelfth International Conference on Learning Representations, 2023

  4. [4]

    Spotting LLMs With Binoculars: Zero- Shot Detection of Machine-Generated Text

    Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Spotting LLMs With Binoculars: Zero- Shot Detection of Machine-Generated Text. InProceedings of the 41st International Conference on Machine Learning, pages 17519–17537, 2024

  5. [5]

    DetectLLM: Leveraging Log Rank Information for Zero-Shot Detection of Machine-Generated Text

    Jinyan Su, Terry Zhuo, Di Wang, and Preslav Nakov. DetectLLM: Leveraging Log Rank Information for Zero-Shot Detection of Machine-Generated Text. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 12395–12412, 2023

  6. [6]

    DALD: Improving Logits-based Detector without Logits from Black-box LLMs.Advances in Neural Information Processing Systems, 37:54947–54973, 2024

    Cong Zeng, Shengkun Tang, Xianjun Yang, Yuanzhou Chen, Yiyou Sun, Zhiqiang Xu, Yao Li, Haifeng Chen, Wei Cheng, and Dongkuan Xu. DALD: Improving Logits-based Detector without Logits from Black-box LLMs.Advances in Neural Information Processing Systems, 37:54947–54973, 2024

  7. [7]

    DMAP: A Distribution Map for Text

    Tom Kempton, Julia Rozanova, Parameswaran Kamalaruban, Maeve Madigan, Karolina Wre- silo, Yoann Launay, David Sutton, and Stuart Burrell. DMAP: A Distribution Map for Text. In The Fourteenth International Conference on Learning Representations, 2026

  8. [8]

    TempTest: Local Normalization Distortion and the Detection of Machine-generated Text

    Tom Kempton, Stuart Burrell, and Connor J Cheverall. TempTest: Local Normalization Distortion and the Detection of Machine-generated Text. InInternational Conference on Artificial Intelligence and Statistics, pages 1972–1980. PMLR, 2025

  9. [9]

    OpenAI GPT-5 System Card

    OpenAI. GPT-5 system card. Technical report, OpenAI, August 2025. arXiv:2601.03267

  10. [10]

    Gemini 3.1 Pro Model Card

    Google DeepMind. Gemini 3.1 Pro Model Card. Technical report, Google DeepMind, February 2026

  11. [11]

    Claude Sonnet 4.6 System Card

    Anthropic. Claude Sonnet 4.6 System Card. Technical report, Anthropic, February 2026

  12. [12]

    Calibrating Language Models with Adaptive Temperature Scaling

    Johnathan Xie, Annie S Chen, Yoonho Lee, Eric Mitchell, and Chelsea Finn. Calibrating Language Models with Adaptive Temperature Scaling. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18128–18138, 2024

  13. [13]

    RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns.Transactions of the Association for Computational Linguistics, 13:1812–1831, 2025

    Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Ziyang Luo, Di Wang, Min Yang, Lidia S Chao, and Derek F Wong. RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns.Transactions of the Association for Computational Linguistics, 13:1812–1831, 2025

  14. [14]

    RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors

    Liam Dugan, Alyssa Hwang, Filip Trhlík, Andrew Zhu, Josh Magnus Ludan, Hainiu Xu, Daphne Ippolito, and Chris Callison-Burch. RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12463–12492, 2024

  15. [15]

    When ai settles down: Late-stage stability as a signature of ai-generated text detection.arXiv preprint arXiv:2601.04833, 2026

    Ke Sun, Guangsheng Bao, Han Cui, and Yue Zhang. When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection.arXiv preprint arXiv:2601.04833, 2026. 10

  16. [16]

    SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis

    Haitong Luo, Weiyao Zhang, Suhang Wang, Wenji Zou, Chungang Lin, Xuying Meng, and Yujun Zhang. SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32356–32364, 2026

  17. [17]

    Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts.Advances in Neural Information Processing Systems, 36, 2024

    Eduard Tulchinskii, Kristian Kuznetsov, Laida Kushnareva, Daniil Cherniavskii, Sergey Nikolenko, Evgeny Burnaev, Serguei Barannikov, and Irina Piontkovskaya. Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts.Advances in Neural Information Processing Systems, 36, 2024

  18. [18]

    Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review

    Sungduk Yu, Man Luo, Avinash Madasu, Vasudev Lal, and Phillip Howard. Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review. InThe Fourteenth International Conference on Learning Representations, 2026

  19. [19]

    Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008. A Why one should expect naive token-score aggregation to underperform In the introduction we made three claims motivating Simpson’s paradox based on previous results in the literature. In this section we expand on those cla...