arxiv: 2605.02620 · v1 · submitted 2026-05-04 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race

Andreas Maier , Moritz Zaiss , Siming Bayer

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:26 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords AI-text detectionagentic researchstyle post-editingadversarial adaptationLLM evaluationreproducibilityembedding classifiers

0 comments

The pith

A frontier LLM agent can lower its AI-detection probability with moderate effort against a known detector.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that an agentic research harness lets one researcher reproduce and extend a full NLP study on personal-style post-editing of LLM drafts in three hours rather than weeks. It recovers every preregistered result, including the exact headline correlation between perceived and embedding-measured self-similarity, and shows that frontier models close 71-75 percent of the remaining style gap to humans while outperforming human post-editors on roughly 80 percent of tasks. Reframing the same data as an AI-text detection problem, the work trains high-AUC SVM detectors on style embeddings and then runs an iterative adversarial loop in which the LLM receives detection scores and revises its output over twenty rounds. This matters for a sympathetic reader because it shows that style-based detectors can already be defeated by the models they target using only modest, feedback-driven adaptation.

Core claim

After confirming the original study's correlation of r = +0.244 on 648 paired drafts, the agentic protocol shows GPT-5.5 and Claude Opus close 71-75 percent of the style gap to the same-author ceiling on 324 held-out tasks and beat the human post-edit on about 80 percent of them. When the outputs are scored by a leave-authors-out linear SVM on LUAR-MUD embeddings, detection reaches AUC 0.93-1.00. Six diagnostics separate length confounds from genuine stylistic signals. Given twenty feedback iterations against the frozen detector, the Opus agent flips two of five held-out test mimics into the human half-space and shrinks every detection margin by an order of magnitude.

What carries the argument

The iterative adversarial feedback loop that supplies the current detection score and margin from the fixed SVM to the LLM and requests revised mimic drafts for up to twenty rounds.

If this is right

All seven preregistered hypotheses from the source study are recovered under the agentic protocol.
Frontier models already outperform human post-editors in matching personal style on the majority of tasks.
Style detectors based on these embeddings are vulnerable to iterative self-correction by the generating model.
Diagnostic checks can isolate length confounds from true stylistic detection signals.
Full release of code, drafts, detectors, and trajectories enables direct community replication and extension.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Static embedding detectors may require continuous retraining on adversarial examples to remain effective.
The same feedback loop could be applied to test robustness of other fixed-embedding classifiers in NLP.
Rapid adaptation by agents implies that purely static detectors will lag behind evolving model capabilities.
Evaluations of human-AI writing collaboration will need to incorporate these evasion techniques.

Load-bearing premise

The agentic harness faithfully executes the original preregistered protocol and the new adversarial loop without introducing implementation artifacts or data leakage.

What would settle it

An independent execution of the twenty-iteration adversarial loop against the same detector that moves none of the five held-out mimics across the boundary and produces no order-of-magnitude margin reduction would falsify the efficiency claim.

Figures

Figures reproduced from arXiv: 2605.02620 by Andreas Maier, Moritz Zaiss, Siming Bayer.

**Figure 1.** Figure 1: Pipeline used by the companion code. Solid arrows: data flow. Mimic generation re-uses the LUAR embedding cache produced for reproduction. Stage names match the script files (00_..., 01_..., etc.) in the released repository. is available directly during fine-tuning or pretraining, as in Karpathy’s autoresearch [6]. We therefore use the same template in a narrower RRPR Track 2 sense: a tightly-defined task… view at source ↗

**Figure 2.** Figure 2: Visual reproduction of three of the figures published in Baumler et al. [2], generated by our re-implemented pipeline. Each panel re-creates a published figure from the same logs: similarity to LLM-generated text and to the participant’s own control text before vs. after post-editing (left, paper view at source ↗

**Figure 3.** Figure 3: Final 4-way assessment. (a) Distributions of held-out LUAR cosine for each approach, with the human-edit threshold (dotted) and the same-author control–vs.– control upper bound (0.701, dashed) annotated. Numbers above the medians are perapproach means. (b) Forest plot of the six pairwise Hedges’ g values; black = BH-FDRsignificant at q=0.05, grey = not. (c) Per-task win rate vs. the human-edit threshold … view at source ↗

**Figure 4.** Figure 4: Left: detection AUC under leave-authors-out 5-fold CV; bars show 95 % percentile bootstrap CIs over the 5 fold AUCs. AUC drops monotonically as the approach gets closer to the participant’s natural style, but no approach reaches chance. Right: t-SNE(2) projection of the same 1 458 LUAR embeddings the detector sees. Triangles are the 162 unassisted-control human embeddings; dots are the four AI approaches,… view at source ↗

**Figure 5.** Figure 5: shows the trajectories. Every target moves consistently in the right direction: mean margin descends from +3.13 at iteration 0 to −0.16 at iteration 20, a ∼3.3-margin swing. Two of the five (thank, letter) cross the SVM decision boundary into “human” and stay there; a third (reassurance) finishes within 0.04 margin units of the boundary. None of the five trajectories plateaus on the AI side—all are still … view at source ↗

read the original abstract

Reproducing an empirical NLP study used to take weeks. Given the released data and a modern agentic-research harness, we redo every experiment of a recent ACL\,2026 study on personal-style post-editing of LLM drafts -- and add three new ones -- with the human investigator acting only as a reviewer-in-the-loop. We reproduce all seven preregistered hypotheses and recover the paper's headline correlation between perceived self-similarity and embedding-measured self-similarity to three decimal places ($r{=}{+}0.244$, $p{<}10^{-8}$, $n{=}648$). Under a leakage-free held-out protocol, GPT-5.5 and Claude\,Opus\,4.7 close $71$--$75\,\%$ of the style gap to the same-author ceiling on $324$ paired tasks, against $24\,\%$ for the human post-edit, and beat the human post-edit on $\sim$$80\,\%$ of tasks. We then frame the same data as an AI-text detection arms race. A leave-authors-out linear SVM on LUAR-MUD embeddings reaches AUC $0.93$--$1.00$ across approaches; six diagnostics show that GPT-5.5 detection is mostly a length confound while Opus detection is a genuine stylistic signature. Given $T{=}20$ feedback iterations against the frozen detector, an Opus agent flips two of five held-out test mimics to the human half-space and shrinks every margin by an order of magnitude. With moderate effort against a known detector, a frontier LLM can already efficiently lower its own AI-detection probability. All code, $648$ mimic drafts, trained detectors, diagnostics, and adversarial trajectories are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript uses an agentic research harness to reproduce and extend a prior ACL 2026 study on personal-style post-editing of LLM drafts. It reproduces all seven preregistered hypotheses and recovers the headline correlation between perceived and embedding-measured self-similarity to three decimal places (r = +0.244, p < 10^{-8}, n = 648). Frontier models (GPT-5.5 and Claude Opus 4.7) close 71–75% of the style gap to the same-author ceiling on 324 paired tasks (vs. 24% for human post-edits) and beat human edits on ~80% of tasks. Framing the data as an AI-text detection arms race, a leave-authors-out linear SVM on LUAR-MUD embeddings achieves AUC 0.93–1.00; six diagnostics indicate GPT-5.5 detection is largely a length confound while Opus detection reflects genuine stylistic signatures. With T = 20 feedback iterations against the frozen detector, an Opus agent flips two of five held-out mimics into the human half-space and shrinks all margins by an order of magnitude. All code, 648 drafts, detectors, diagnostics, and trajectories are released.

Significance. If the central empirical results hold under the claimed leakage-free protocol, the work is significant for two reasons: it demonstrates that modern agentic tools can reproduce complex, multi-experiment NLP studies in hours rather than weeks while recovering quantitative results to high precision, and it provides concrete evidence that frontier LLMs can already efficiently lower their own AI-detection probability with moderate, black-box effort. The open release of the full artifact set (648 mimic drafts, trained detectors, six diagnostics, and 20-iteration trajectories) is a clear strength that enables direct verification and reuse.

major comments (2)

[Abstract and adversarial-experiments section] Abstract and adversarial-experiments section: The headline claim that the Opus agent with T=20 iterations flips two of five held-out mimics to the human half-space and shrinks margins by an order of magnitude depends on the agentic harness executing a strictly black-box, leakage-free loop that supplies only scalar probabilities or margins from the frozen leave-authors-out LUAR-MUD SVM. The manuscript asserts this protocol but does not include explicit interface specification, pseudocode, or verification that the agent has no access to embeddings, weights, or training-set statistics; any such leakage would turn the reported 71–75% gap closure and ~80% win rate into an artifact rather than evidence of general stylistic evasion.
[Detector-diagnostics subsection] Detector-diagnostics subsection: The six diagnostics are used to conclude that GPT-5.5 detection is mostly a length confound while Opus detection is a genuine stylistic signature, supporting the interpretation of the AUC 0.93–1.00 results. However, the main text does not report per-diagnostic effect sizes, confidence intervals, or ablation tables showing how much variance each diagnostic explains; without these, it is difficult to assess whether the length-confound claim for GPT-5.5 is robust enough to underwrite the broader arms-race conclusions.

minor comments (3)

[Abstract] Abstract: the exact p-value for the self-similarity correlation is given only as p < 10^{-8}; reporting the precise value or a 95% CI would improve precision.
Ensure LUAR-MUD and all other acronyms are defined at first use in the main text rather than only in the abstract.
[Abstract] The manuscript states that all 648 drafts and trajectories are released; adding a direct link or DOI in the abstract would make this immediately actionable for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review of our manuscript. We address each major comment below and have revised the manuscript to provide the requested clarifications and quantitative details.

read point-by-point responses

Referee: [Abstract and adversarial-experiments section] Abstract and adversarial-experiments section: The headline claim that the Opus agent with T=20 iterations flips two of five held-out mimics to the human half-space and shrinks margins by an order of magnitude depends on the agentic harness executing a strictly black-box, leakage-free loop that supplies only scalar probabilities or margins from the frozen leave-authors-out LUAR-MUD SVM. The manuscript asserts this protocol but does not include explicit interface specification, pseudocode, or verification that the agent has no access to embeddings, weights, or training-set statistics; any such leakage would turn the reported 71–75% gap closure and ~80% win rate into an artifact rather than evidence of general stylistic evasion.

Authors: We thank the referee for underscoring the need for explicit verification of the black-box protocol. The released code repository contains the complete agentic harness implementation, which interfaces with the frozen SVM exclusively via scalar margin values and contains no access to embeddings, weights, or training statistics. To make this transparent in the main text without requiring readers to inspect the code, we have added a new paragraph in the adversarial-experiments section that specifies the interface contract and includes pseudocode for the T=20 iteration loop. This addition confirms the leakage-free nature of the setup and supports the reported gap-closure and win-rate results as genuine evidence of stylistic evasion. revision: yes
Referee: [Detector-diagnostics subsection] Detector-diagnostics subsection: The six diagnostics are used to conclude that GPT-5.5 detection is mostly a length confound while Opus detection is a genuine stylistic signature, supporting the interpretation of the AUC 0.93–1.00 results. However, the main text does not report per-diagnostic effect sizes, confidence intervals, or ablation tables showing how much variance each diagnostic explains; without these, it is difficult to assess whether the length-confound claim for GPT-5.5 is robust enough to underwrite the broader arms-race conclusions.

Authors: We agree that quantitative reporting of the diagnostics would strengthen the interpretation. In the revised manuscript we have expanded the Detector-diagnostics subsection with a new table that reports per-diagnostic effect sizes (R² and partial correlations), 95% confidence intervals, and an ablation analysis quantifying the incremental variance explained by each diagnostic. These values, computed from the already-released diagnostic scripts, confirm that length accounts for the large majority of explained variance in GPT-5.5 detection while stylistic features dominate for Opus. The text has been updated to reference the table and its implications for the arms-race framing. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical reproduction with fully released artifacts

full rationale

The paper performs an empirical reproduction of a prior ACL study on style post-editing, using an agentic harness to rerun all experiments and add new ones. It reports recovering the original correlation (r=+0.244) to three decimal places on n=648 tasks, measures attack success against a frozen SVM detector, and explicitly states that all 648 mimic drafts, trained detectors, diagnostics, and adversarial trajectories are released. No load-bearing mathematical derivation, fitted parameter renamed as prediction, or self-citation chain appears; every central claim is tied to externally verifiable data and code rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes an empirical reproduction and extension study with no explicit mathematical axioms, free parameters, or invented entities stated.

pith-pipeline@v0.9.0 · 5619 in / 1116 out tokens · 20815 ms · 2026-05-08T19:26:23.292735+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs
cs.CV 2026-05 accept novelty 6.0

A 30-token prompt requesting a neutral comparison table cuts sponsored recommendations in LLMs from roughly 50% to near zero.

Reference graph

Works this paper leans on

12 extracted references · 10 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Frontiers in Psy- chology8, 456 (2017).https://doi.org/10.3389/fpsyg.2017.00456

Bakdash, J.Z., Marusich, L.R.: Repeated measures correlation. Frontiers in Psy- chology8, 456 (2017).https://doi.org/10.3389/fpsyg.2017.00456

work page doi:10.3389/fpsyg.2017.00456 2017
[2]

Can You Make It Sound Like You? Post-Editing LLM-Generated Text for Personal Style

Baumler, C., Bao, C., Nghiem, H., Yang, X., Carpuat, M., Daumé III, H.: Can you make it sound like you? Post-Editing LLM-Generated text for personal style. arXiv:2604.24444; to appear at ACL 2026 (2026),https://arxiv.org/abs/2604. 24444

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Benjamini, Y

Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological)57(1), 289–300 (1995).https://doi.org/10.1111/j. 2517-6161.1995.tb02031.x

work page doi:10.1111/j 1995
[4]

Journal of the American Statistical Association32(200), 675–701 (1937).https://doi.org/10.1080/01621459.1937.10503522

Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association32(200), 675–701 (1937).https://doi.org/10.1080/01621459.1937.10503522

work page doi:10.1080/01621459.1937.10503522 1937
[5]

Journal of Educational Statistics6(2), 107–128 (1981).https://doi

Hedges, L.V.: Distribution theory for Glass’s estimator of effect size and related estimators. Journal of Educational Statistics6(2), 107–128 (1981).https://doi. org/10.3102/10769986006002107

work page doi:10.3102/10769986006002107 1981
[6]

GitHub repository (2026),https://github.com/ karpathy/autoresearch

Karpathy, A.: autoresearch: AI agents running research on single-GPU nanochat training automatically. GitHub repository (2026),https://github.com/ karpathy/autoresearch

2026
[7]

Progress in Biomedical Engineering4(2), 022002 (2022).https: //doi.org/10.1088/2516-1091/ac5b13

Maier, A., Köstler, H., Heisig, M., Krauß, P., Yang, S.H.: Known operator learning and hybrid machine learning in medical imaging—A review of the past, the present, and the future. Progress in Biomedical Engineering4(2), 022002 (2022).https: //doi.org/10.1088/2516-1091/ac5b13

work page doi:10.1088/2516-1091/ac5b13 2022
[8]

In: Pro- ceedings of the 24th International Conference on Pattern Recognition (ICPR)

Maier, A., Schebesch, F., Syben, C., Würfl, T., Steidl, S., Choi, J.H., Fahrig, R.: Precision learning: Towards use of known operators in neural networks. In: Pro- ceedings of the 24th International Conference on Pattern Recognition (ICPR). pp. 183–188. IEEE (2018).https://doi.org/10.1109/ICPR.2018.8545553

work page doi:10.1109/icpr.2018.8545553 2018
[9]

Zeitschrift für Medizinische Physik29(2), 86–101 (2019)

Maier, A., Syben, C., Lasser, T., Riess, C.: A gentle introduction to deep learning in medical image processing. Zeitschrift für Medizinische Physik29(2), 86–101 (2019).https://doi.org/10.1016/j.zemedi.2018.12.003

work page doi:10.1016/j.zemedi.2018.12.003 2019
[10]

In: Proceedings of the 2021ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP)

Rivera-Soto, R.A., Miano, O.E., Ordonez, J., Chen, B.Y., Khan, A., Bishop, M., Andrews, N.: Learning universal authorship representations. In: Proceedings of the 2021ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP). pp. 913–919 (2021),https://aclanthology.org/2021.emnlp-main.70/

2021
[11]

Biometrics Bulletin , author =

Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bulletin 1(6), 80–83 (1945).https://doi.org/10.2307/3001968

work page doi:10.2307/3001968 1945
[12]

Agentic MR sequence development: leveraging LLMs with MR skills for automatic physics-informed sequence development

Zaiss, M., Aly, A., Endres, J., Dornstetter, T., Weinmüller, S., Maier, A.: Agentic MR sequence development: leveraging LLMs with MR skills for automatic physics- informed sequence development. arXiv:2604.13282 (2026),https://arxiv.org/ abs/2604.13282

work page internal anchor Pith review Pith/arXiv arXiv 2026