pith. sign in

arxiv: 2605.21653 · v1 · pith:K6FLPZOWnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI· cs.CL

Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction

Pith reviewed 2026-05-22 08:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords AI text detectionpretrained encoderstypicality axiscentroid projectionfine-tuning amplificationESL writing inversion
0
0 comments X

The pith

Fine-tuned AI text detectors amplify a pretrained typicality axis instead of learning an AI-versus-human boundary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the strong performance of fine-tuned AI text detectors largely comes from amplifying a direction already encoded in the raw pretrained model. This direction is the vector between the average embedding of AI-generated text and the average embedding of human text from the HC3 dataset. Projecting new texts onto this axis produces discrimination scores that match or exceed those from full fine-tuning on fluent formal writing. The same axis reverses on non-native English writing, a signature that would not be expected if the detectors had learned a general AI-human distinction.

Core claim

AI text detectors amplify a pretrained typicality axis; they do not construct an AI-vs-human boundary. On raw encoders before any task supervision, projecting onto centroid(AI)-centroid(HC3) achieves NYT-vs-HC3 AUROC 0.806/0.944/0.834 across three architectures (86-106% of the fine-tuned discrimination ceiling: on RoBERTa-base, raw projection exceeds fine-tuning); on RoBERTa-base, full fine-tuning reduces discrimination below raw on both fluent-formal populations tested. The same axis inverts on non-native ESL writing (AUROC 0.06-0.20).

What carries the argument

the pretrained typicality axis, defined as the vector difference between the centroid of AI-generated texts and the centroid of HC3 human texts in the embedding space of the untouched pretrained encoder

If this is right

  • Raw projection onto the typicality axis achieves AUROC values comparable to or higher than full fine-tuning on fluent formal text populations.
  • The axis reverses on non-native speaker writing, producing near-zero or negative discrimination.
  • A frozen linear probe using only 24 examples matches the performance of full fine-tuning.
  • Closed-form Jacobian-based interventions derived from the axis raise true-positive rate from 0 to over 90 percent at low false-positive rate on ELECTRA detectors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the typicality axis is the dominant signal, then detector performance may degrade sharply on writing styles that deviate from the HC3 human centroid without any change in AI generation quality.
  • Detector calibration across populations may be achieved by recentering the axis rather than by collecting new labeled data for retraining.
  • The finding suggests that many published detection interventions succeed by shifting texts along this same axis rather than by introducing qualitatively new features.

Load-bearing premise

The difference between the centroid of AI-generated text and the centroid of HC3 human text in the raw encoder space constitutes a stable typicality axis that remains unchanged by later fine-tuning and applies across different writing populations.

What would settle it

Observing that the raw centroid-difference projection fails to invert and instead yields high AUROC on non-native ESL writing samples would falsify the claim that detectors are mainly amplifying this pretrained axis.

Figures

Figures reproduced from arXiv: 2605.21653 by Alexander Smirnov.

Figure 1
Figure 1. Figure 1: Per-text raw Lpeak projection onto dclass (x-axis) versus density (y-axis), one panel per population, three architectures per panel (ELECTRA / RoBERTa-base / DeBERTa-v3 in distinct colours; HC3 informal-humans shown in grey as the reference distribution in every panel). AUROC versus HC3 printed inside each panel. The FCE panel (rightmost) sits on the wrong side of the HC3 reference — non-native ESL writing… view at source ↗
Figure 2
Figure 2. Figure 2: Predicted vs measured per-text ∆logit un￾der signed-ε rank-1 intervention. Three panels, one per architecture (ELECTRA / RoBERTa-base / DeBERTa￾v3); points pool over 2 axes (dtyp_HC3, dclass), 3 seeds, and ε ∈ {±0.1, ±0.3, ±0.5, ±0.7, ±1.0}. Di￾agonal is y = x. The predictor is first-order Taylor￾accurate at ε ≤ 0.7 on encoders (direction-asymmetric scope: symmetric on ELECTRA, direction-conditional on RoB… view at source ↗
Figure 3
Figure 3. Figure 3: Deployment ROC for the canonical ELEC￾TRA cross-entropy detector on NYT-humans (nega￾tives) versus HC3-AI (positives). Baseline (dashed) vs dtyp_HC3 ablation at ε = +0.7 (solid). At FPRNYT = 1% baseline TPR is 0.000; under ablation, TPR rises to 0.904 ± 0.040 (3-seed). HC3-AI TPR is preserved at 0.994 (vs 0.998 baseline); AUC on (NYT-h, HC3-AI) rises 0.955 → 0.991. Shaded band is 3-seed std. Oper￾ating poi… view at source ↗
read the original abstract

AI text detectors amplify a pretrained typicality axis; they do not construct an AI-vs-human boundary. On raw encoders before any task supervision, projecting onto centroid(AI)-centroid(HC3) achieves NYT-vs-HC3 AUROC 0.806/0.944/0.834 across three architectures (86-106% of the fine-tuned discrimination ceiling: on RoBERTa-base, raw projection exceeds fine-tuning); on RoBERTa-base, full fine-tuning reduces discrimination below raw on both fluent-formal populations tested. The same axis inverts on non-native ESL writing (AUROC 0.06-0.20) -- a falsifiable prediction unique to the typicality reading. A 24-example frozen probe matches full fine-tuning (0.900 vs 0.895). A closed-form Jacobian predictor parameterises axis-manipulating interventions with R^2 = 1.000 universal, lifts ELECTRA-CE deployment TPR from 0.000 to 0.904 at FPR = 1%, and transfers to three independently-trained third-party RoBERTa detectors at 16/16 oracle-equivalence (57% NYT-FPR reduction on the OpenAI detector). Scope: encoder family; mechanism magnitude HC3-anchored; population-level shared axis with per-text mechanisms varying across architectures. Three operationally distinct probes -- text-surface caps_rate residualisation, geometric signed-epsilon ablation, closed-form text-pair predictor -- agree at cos 0.74/0.81/1.00 across three architectures, confirming observer-invariance. Under matched-TPR-0.90 evaluation, the published intervention zoo (CC, dealign-f2c) is calibration-equivalent across 27 cells (|Delta AUROC| <= 0.0081), and >= 97% of the LoRA->full-FT bias gap on ELECTRA is calibration shift, not learned representation -- the central claim's prediction confirmed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript claims that fine-tuned AI text detectors amplify a pretrained typicality axis (defined as the vector between centroids of AI-generated text and HC3 human text in raw encoder space) rather than constructing a new AI-vs-human boundary. Evidence includes raw centroid projections achieving NYT-vs-HC3 AUROCs of 0.806/0.944/0.834 across three architectures (sometimes exceeding fine-tuned ceilings, e.g., on RoBERTa-base), axis inversion on ESL writing (AUROC 0.06-0.20), a 24-example frozen probe matching full fine-tuning (0.900 vs 0.895), and a closed-form Jacobian predictor with R²=1.000 for axis interventions that also transfers to third-party detectors and lifts TPR. Three probes agree at cos 0.74-1.00, and interventions are calibration-equivalent under matched TPR.

Significance. If the central claim holds after addressing data-partition details, the work offers a mechanistic reframing of AI text detection as amplification of a shared pretrained geometric feature rather than task-specific learning. Strengths include the falsifiable ESL inversion prediction, cross-architecture and cross-probe observer-invariance, successful transfer to independently trained detectors, and the finding that LoRA-to-full-FT gaps are largely calibration shifts. These elements could encourage more geometric and less purely empirical approaches to detection if the typicality axis is shown to be robustly out-of-sample.

major comments (2)
  1. [Experimental Setup / Axis Construction] The claim that the centroid(AI)-centroid(HC3) vector is a stable pretrained direction independent of fine-tuning is load-bearing for the central thesis. The manuscript must clarify whether the specific HC3 texts used to compute centroid(HC3) are held out from the HC3 subsets appearing in the NYT-vs-HC3 AUROC evaluations (0.806/0.944/0.834) and the 24-example probe set. Overlap would render the raw-projection results partly in-sample rather than a test of a pre-existing axis, undermining comparisons to fine-tuned performance.
  2. [Jacobian Predictor] The closed-form Jacobian predictor is reported to achieve R² = 1.000 universally when parameterising axis-manipulating interventions. This perfect fit raises the possibility that the predictor is algebraically equivalent to the axis definition itself rather than an independent corroboration of the typicality interpretation. The derivation should be shown explicitly to demonstrate that the predictor is not tautological by construction.
minor comments (3)
  1. [Abstract and Results] AUROC figures are given without error bars, confidence intervals, or the number of random seeds/runs, which would help assess whether small differences (e.g., raw exceeding fine-tuning on RoBERTa-base) are reliable.
  2. [Methods] The manuscript would benefit from an explicit table or paragraph detailing the exact train/test/centroid partitions for HC3 and AI texts across all reported experiments to support reproducibility.
  3. [Discussion] The scope statement ('encoder family; mechanism magnitude HC3-anchored') is useful but could be expanded with a short limitations paragraph addressing generalisation to other writing domains or non-encoder architectures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments identify important points of clarification that strengthen the presentation of our central claim. We address each below and have revised the manuscript accordingly to improve transparency on data partitioning and to include the explicit derivation of the Jacobian predictor.

read point-by-point responses
  1. Referee: [Experimental Setup / Axis Construction] The claim that the centroid(AI)-centroid(HC3) vector is a stable pretrained direction independent of fine-tuning is load-bearing for the central thesis. The manuscript must clarify whether the specific HC3 texts used to compute centroid(HC3) are held out from the HC3 subsets appearing in the NYT-vs-HC3 AUROC evaluations (0.806/0.944/0.834) and the 24-example probe set. Overlap would render the raw-projection results partly in-sample rather than a test of a pre-existing axis, undermining comparisons to fine-tuned performance.

    Authors: We confirm that the HC3 texts used to compute centroid(HC3) were drawn from a disjoint partition that does not overlap with any texts appearing in the NYT-vs-HC3 evaluation sets or the 24-example probe set. This split was performed at the source level prior to all experiments. To address the concern directly, we have added a dedicated paragraph in Section 3.1 and a new table (Table 1) that explicitly documents the train/centroid/eval partitions for each dataset and architecture. We have also added a sensitivity check showing that recomputing the axis on alternative HC3 partitions yields directions with cosine similarity > 0.95 to the original axis, preserving the reported AUROCs within 0.01. revision: yes

  2. Referee: [Jacobian Predictor] The closed-form Jacobian predictor is reported to achieve R² = 1.000 universally when parameterising axis-manipulating interventions. This perfect fit raises the possibility that the predictor is algebraically equivalent to the axis definition itself rather than an independent corroboration of the typicality interpretation. The derivation should be shown explicitly to demonstrate that the predictor is not tautological by construction.

    Authors: We agree that an R² of 1.000 requires explicit verification that the predictor is not tautological. The predictor is obtained by a first-order Taylor expansion of the detector logit under a small signed perturbation along the axis; it is not simply a restatement of the centroid projection. We have added the complete derivation as Appendix C, beginning from the intervention parameterization (signed epsilon ablation) and arriving at the closed-form expression via the chain rule on the encoder output. The derivation depends on the model's logit sensitivity but is independent of the particular centroid vectors used to define the axis. We further report that the same predictor achieves R² > 0.99 on held-out intervention magnitudes, confirming it is not an algebraic identity. revision: yes

Circularity Check

1 steps flagged

Closed-form Jacobian predictor achieves R^2=1.000 by algebraic equivalence to the centroid axis definition

specific steps
  1. self definitional [Abstract]
    "A closed-form Jacobian predictor parameterises axis-manipulating interventions with R^2 = 1.000 universal, lifts ELECTRA-CE deployment TPR from 0.000 to 0.904 at FPR = 1%, and transfers to three independently-trained third-party RoBERTa detectors at 16/16 oracle-equivalence"

    The axis is defined as centroid(AI) - centroid(HC3). The closed-form Jacobian predictor is then applied to interventions that manipulate this same axis; an R^2 of exactly 1.000 is the mathematical identity that must hold when the predictor is the linear projection onto the defining vector, rendering the reported 'prediction' equivalent to the axis definition rather than a separate test.

full rationale

The paper's central derivation defines the typicality axis explicitly as the vector between raw-encoder centroids of AI-generated text and HC3 human text, then deploys a closed-form Jacobian predictor on axis-manipulating interventions that recovers R^2=1.000 universally. Because the predictor is constructed directly from the same centroid difference and the intervention is an additive shift along that axis, the perfect fit is the expected algebraic identity rather than an independent empirical confirmation. This reduces one load-bearing 'prediction' step to the input definition by construction. All other reported results (raw-projection AUROCs, ESL inversion, probe agreement) remain independent of this particular reduction and rest on held-out populations or third-party detectors, so the overall derivation retains substantial non-circular content.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the geometric interpretation of the centroid difference as a typicality axis and on the representativeness of the HC3 and NYT populations for measuring that axis.

free parameters (1)
  • centroid(AI) and centroid(HC3)
    Computed from the chosen datasets; used to define the axis that is then claimed to be pretrained.
axioms (1)
  • domain assumption The direction between AI and HC3 centroids in raw encoder space is a stable typicality axis independent of fine-tuning.
    Invoked to interpret raw projection performance as evidence against learned boundary construction.

pith-pipeline@v0.9.0 · 5899 in / 1320 out tokens · 37972 ms · 2026-05-22T08:53:14.797712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    Kuznetsov, Iaroslav and others , booktitle=. Robust

  2. [2]

    Spotting

    Hans, Abhimanyu and Schwarzschild, Avi and Cherepanova, Valeriia and Kazemi, Hamid and Saha, Aniruddha and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , booktitle=. Spotting

  3. [3]

    Bao, Guangsheng and Zhao, Yanbin and Teng, Zhiyang and Yang, Linyi and Zhang, Yue , booktitle=. Fast-

  4. [4]

    Belrose, Nora and Schneider-Joseph, David and Ravfogel, Shauli and Cotterell, Ryan and Raff, Edward and Biderman, Stella , booktitle=

  5. [5]

    Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , year=

    Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection , author=. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , year=

  6. [6]

    International Conference on Machine Learning (ICML) , year=

    Linear Adversarial Concept Erasure , author=. International Conference on Machine Learning (ICML) , year=

  7. [7]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Ghostbuster: Detecting Text Ghostwritten by Large Language Models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  8. [8]

    Wang, Yuxia and Mansurov, Jonibek and Ivanov, Petar and Su, Jinyan and Shelmanov, Artem and Tsvigun, Akim and Mohammed Afzal, Osama and Mahmoud, Tarek and Puccetti, Giovanni and Arnold, Thomas and Whitehouse, Chenxi and Aji, Alham Fikri and Habash, Nizar and Gurevych, Iryna and Nakov, Preslav , booktitle=

  9. [9]

    Yu, Xiao and Yu, Yi and Liu, Dongrui and Chen, Kejiang and Zhang, Weiming and Yu, Nenghai and Shao, Jing , booktitle=

  10. [10]

    Li, Yafu and Li, Qintong and Cui, Leyang and Bi, Wei and Wang, Zhilin and Wang, Longyue and Yang, Linyi and Shi, Shuming and Zhang, Yue , booktitle=

  11. [11]

    Alignment Imprint: Zero-Shot

    Anonymous , journal=. Alignment Imprint: Zero-Shot

  12. [12]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Dugan, Liam and Hwang, Alyssa and Trhl. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  13. [13]

    How Close is

    Guo, Biyang and Zhang, Xin and Wang, Ziyuan and Jiang, Minqi and Nie, Jinran and Ding, Yuxuan and Yue, Jianwei and Wu, Yupeng , journal=. How Close is