Recognition: 2 theorem links
Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race
Pith reviewed 2026-05-08 19:26 UTC · model grok-4.3
The pith
A frontier LLM agent can lower its AI-detection probability with moderate effort against a known detector.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
After confirming the original study's correlation of r = +0.244 on 648 paired drafts, the agentic protocol shows GPT-5.5 and Claude Opus close 71-75 percent of the style gap to the same-author ceiling on 324 held-out tasks and beat the human post-edit on about 80 percent of them. When the outputs are scored by a leave-authors-out linear SVM on LUAR-MUD embeddings, detection reaches AUC 0.93-1.00. Six diagnostics separate length confounds from genuine stylistic signals. Given twenty feedback iterations against the frozen detector, the Opus agent flips two of five held-out test mimics into the human half-space and shrinks every detection margin by an order of magnitude.
What carries the argument
The iterative adversarial feedback loop that supplies the current detection score and margin from the fixed SVM to the LLM and requests revised mimic drafts for up to twenty rounds.
If this is right
- All seven preregistered hypotheses from the source study are recovered under the agentic protocol.
- Frontier models already outperform human post-editors in matching personal style on the majority of tasks.
- Style detectors based on these embeddings are vulnerable to iterative self-correction by the generating model.
- Diagnostic checks can isolate length confounds from true stylistic detection signals.
- Full release of code, drafts, detectors, and trajectories enables direct community replication and extension.
Where Pith is reading between the lines
- Static embedding detectors may require continuous retraining on adversarial examples to remain effective.
- The same feedback loop could be applied to test robustness of other fixed-embedding classifiers in NLP.
- Rapid adaptation by agents implies that purely static detectors will lag behind evolving model capabilities.
- Evaluations of human-AI writing collaboration will need to incorporate these evasion techniques.
Load-bearing premise
The agentic harness faithfully executes the original preregistered protocol and the new adversarial loop without introducing implementation artifacts or data leakage.
What would settle it
An independent execution of the twenty-iteration adversarial loop against the same detector that moves none of the five held-out mimics across the boundary and produces no order-of-magnitude margin reduction would falsify the efficiency claim.
Figures
read the original abstract
Reproducing an empirical NLP study used to take weeks. Given the released data and a modern agentic-research harness, we redo every experiment of a recent ACL\,2026 study on personal-style post-editing of LLM drafts -- and add three new ones -- with the human investigator acting only as a reviewer-in-the-loop. We reproduce all seven preregistered hypotheses and recover the paper's headline correlation between perceived self-similarity and embedding-measured self-similarity to three decimal places ($r{=}{+}0.244$, $p{<}10^{-8}$, $n{=}648$). Under a leakage-free held-out protocol, GPT-5.5 and Claude\,Opus\,4.7 close $71$--$75\,\%$ of the style gap to the same-author ceiling on $324$ paired tasks, against $24\,\%$ for the human post-edit, and beat the human post-edit on $\sim$$80\,\%$ of tasks. We then frame the same data as an AI-text detection arms race. A leave-authors-out linear SVM on LUAR-MUD embeddings reaches AUC $0.93$--$1.00$ across approaches; six diagnostics show that GPT-5.5 detection is mostly a length confound while Opus detection is a genuine stylistic signature. Given $T{=}20$ feedback iterations against the frozen detector, an Opus agent flips two of five held-out test mimics to the human half-space and shrinks every margin by an order of magnitude. With moderate effort against a known detector, a frontier LLM can already efficiently lower its own AI-detection probability. All code, $648$ mimic drafts, trained detectors, diagnostics, and adversarial trajectories are released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript uses an agentic research harness to reproduce and extend a prior ACL 2026 study on personal-style post-editing of LLM drafts. It reproduces all seven preregistered hypotheses and recovers the headline correlation between perceived and embedding-measured self-similarity to three decimal places (r = +0.244, p < 10^{-8}, n = 648). Frontier models (GPT-5.5 and Claude Opus 4.7) close 71–75% of the style gap to the same-author ceiling on 324 paired tasks (vs. 24% for human post-edits) and beat human edits on ~80% of tasks. Framing the data as an AI-text detection arms race, a leave-authors-out linear SVM on LUAR-MUD embeddings achieves AUC 0.93–1.00; six diagnostics indicate GPT-5.5 detection is largely a length confound while Opus detection reflects genuine stylistic signatures. With T = 20 feedback iterations against the frozen detector, an Opus agent flips two of five held-out mimics into the human half-space and shrinks all margins by an order of magnitude. All code, 648 drafts, detectors, diagnostics, and trajectories are released.
Significance. If the central empirical results hold under the claimed leakage-free protocol, the work is significant for two reasons: it demonstrates that modern agentic tools can reproduce complex, multi-experiment NLP studies in hours rather than weeks while recovering quantitative results to high precision, and it provides concrete evidence that frontier LLMs can already efficiently lower their own AI-detection probability with moderate, black-box effort. The open release of the full artifact set (648 mimic drafts, trained detectors, six diagnostics, and 20-iteration trajectories) is a clear strength that enables direct verification and reuse.
major comments (2)
- [Abstract and adversarial-experiments section] Abstract and adversarial-experiments section: The headline claim that the Opus agent with T=20 iterations flips two of five held-out mimics to the human half-space and shrinks margins by an order of magnitude depends on the agentic harness executing a strictly black-box, leakage-free loop that supplies only scalar probabilities or margins from the frozen leave-authors-out LUAR-MUD SVM. The manuscript asserts this protocol but does not include explicit interface specification, pseudocode, or verification that the agent has no access to embeddings, weights, or training-set statistics; any such leakage would turn the reported 71–75% gap closure and ~80% win rate into an artifact rather than evidence of general stylistic evasion.
- [Detector-diagnostics subsection] Detector-diagnostics subsection: The six diagnostics are used to conclude that GPT-5.5 detection is mostly a length confound while Opus detection is a genuine stylistic signature, supporting the interpretation of the AUC 0.93–1.00 results. However, the main text does not report per-diagnostic effect sizes, confidence intervals, or ablation tables showing how much variance each diagnostic explains; without these, it is difficult to assess whether the length-confound claim for GPT-5.5 is robust enough to underwrite the broader arms-race conclusions.
minor comments (3)
- [Abstract] Abstract: the exact p-value for the self-similarity correlation is given only as p < 10^{-8}; reporting the precise value or a 95% CI would improve precision.
- Ensure LUAR-MUD and all other acronyms are defined at first use in the main text rather than only in the abstract.
- [Abstract] The manuscript states that all 648 drafts and trajectories are released; adding a direct link or DOI in the abstract would make this immediately actionable for readers.
Simulated Author's Rebuttal
We thank the referee for their careful and constructive review of our manuscript. We address each major comment below and have revised the manuscript to provide the requested clarifications and quantitative details.
read point-by-point responses
-
Referee: [Abstract and adversarial-experiments section] Abstract and adversarial-experiments section: The headline claim that the Opus agent with T=20 iterations flips two of five held-out mimics to the human half-space and shrinks margins by an order of magnitude depends on the agentic harness executing a strictly black-box, leakage-free loop that supplies only scalar probabilities or margins from the frozen leave-authors-out LUAR-MUD SVM. The manuscript asserts this protocol but does not include explicit interface specification, pseudocode, or verification that the agent has no access to embeddings, weights, or training-set statistics; any such leakage would turn the reported 71–75% gap closure and ~80% win rate into an artifact rather than evidence of general stylistic evasion.
Authors: We thank the referee for underscoring the need for explicit verification of the black-box protocol. The released code repository contains the complete agentic harness implementation, which interfaces with the frozen SVM exclusively via scalar margin values and contains no access to embeddings, weights, or training statistics. To make this transparent in the main text without requiring readers to inspect the code, we have added a new paragraph in the adversarial-experiments section that specifies the interface contract and includes pseudocode for the T=20 iteration loop. This addition confirms the leakage-free nature of the setup and supports the reported gap-closure and win-rate results as genuine evidence of stylistic evasion. revision: yes
-
Referee: [Detector-diagnostics subsection] Detector-diagnostics subsection: The six diagnostics are used to conclude that GPT-5.5 detection is mostly a length confound while Opus detection is a genuine stylistic signature, supporting the interpretation of the AUC 0.93–1.00 results. However, the main text does not report per-diagnostic effect sizes, confidence intervals, or ablation tables showing how much variance each diagnostic explains; without these, it is difficult to assess whether the length-confound claim for GPT-5.5 is robust enough to underwrite the broader arms-race conclusions.
Authors: We agree that quantitative reporting of the diagnostics would strengthen the interpretation. In the revised manuscript we have expanded the Detector-diagnostics subsection with a new table that reports per-diagnostic effect sizes (R² and partial correlations), 95% confidence intervals, and an ablation analysis quantifying the incremental variance explained by each diagnostic. These values, computed from the already-released diagnostic scripts, confirm that length accounts for the large majority of explained variance in GPT-5.5 detection while stylistic features dominate for Opus. The text has been updated to reference the table and its implications for the arms-race framing. revision: yes
Circularity Check
No circularity: empirical reproduction with fully released artifacts
full rationale
The paper performs an empirical reproduction of a prior ACL study on style post-editing, using an agentic harness to rerun all experiments and add new ones. It reports recovering the original correlation (r=+0.244) to three decimal places on n=648 tasks, measures attack success against a frozen SVM detector, and explicitly states that all 648 mimic drafts, trained detectors, diagnostics, and adversarial trajectories are released. No load-bearing mathematical derivation, fitted parameter renamed as prediction, or self-citation chain appears; every central claim is tied to externally verifiable data and code rather than reducing to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs
A 30-token prompt requesting a neutral comparison table cuts sponsored recommendations in LLMs from roughly 50% to near zero.
Reference graph
Works this paper leans on
-
[1]
Frontiers in Psy- chology8, 456 (2017).https://doi.org/10.3389/fpsyg.2017.00456
Bakdash, J.Z., Marusich, L.R.: Repeated measures correlation. Frontiers in Psy- chology8, 456 (2017).https://doi.org/10.3389/fpsyg.2017.00456
-
[2]
Can You Make It Sound Like You? Post-Editing LLM-Generated Text for Personal Style
Baumler, C., Bao, C., Nghiem, H., Yang, X., Carpuat, M., Daumé III, H.: Can you make it sound like you? Post-Editing LLM-Generated text for personal style. arXiv:2604.24444; to appear at ACL 2026 (2026),https://arxiv.org/abs/2604. 24444
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological)57(1), 289–300 (1995).https://doi.org/10.1111/j. 2517-6161.1995.tb02031.x
work page doi:10.1111/j 1995
-
[4]
Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association32(200), 675–701 (1937).https://doi.org/10.1080/01621459.1937.10503522
-
[5]
Journal of Educational Statistics6(2), 107–128 (1981).https://doi
Hedges, L.V.: Distribution theory for Glass’s estimator of effect size and related estimators. Journal of Educational Statistics6(2), 107–128 (1981).https://doi. org/10.3102/10769986006002107
-
[6]
GitHub repository (2026),https://github.com/ karpathy/autoresearch
Karpathy, A.: autoresearch: AI agents running research on single-GPU nanochat training automatically. GitHub repository (2026),https://github.com/ karpathy/autoresearch
2026
-
[7]
Progress in Biomedical Engineering4(2), 022002 (2022).https: //doi.org/10.1088/2516-1091/ac5b13
Maier, A., Köstler, H., Heisig, M., Krauß, P., Yang, S.H.: Known operator learning and hybrid machine learning in medical imaging—A review of the past, the present, and the future. Progress in Biomedical Engineering4(2), 022002 (2022).https: //doi.org/10.1088/2516-1091/ac5b13
-
[8]
In: Pro- ceedings of the 24th International Conference on Pattern Recognition (ICPR)
Maier, A., Schebesch, F., Syben, C., Würfl, T., Steidl, S., Choi, J.H., Fahrig, R.: Precision learning: Towards use of known operators in neural networks. In: Pro- ceedings of the 24th International Conference on Pattern Recognition (ICPR). pp. 183–188. IEEE (2018).https://doi.org/10.1109/ICPR.2018.8545553
-
[9]
Zeitschrift für Medizinische Physik29(2), 86–101 (2019)
Maier, A., Syben, C., Lasser, T., Riess, C.: A gentle introduction to deep learning in medical image processing. Zeitschrift für Medizinische Physik29(2), 86–101 (2019).https://doi.org/10.1016/j.zemedi.2018.12.003
-
[10]
In: Proceedings of the 2021ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP)
Rivera-Soto, R.A., Miano, O.E., Ordonez, J., Chen, B.Y., Khan, A., Bishop, M., Andrews, N.: Learning universal authorship representations. In: Proceedings of the 2021ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP). pp. 913–919 (2021),https://aclanthology.org/2021.emnlp-main.70/
2021
-
[11]
Biometrics Bulletin , author =
Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bulletin 1(6), 80–83 (1945).https://doi.org/10.2307/3001968
-
[12]
Zaiss, M., Aly, A., Endres, J., Dornstetter, T., Weinmüller, S., Maier, A.: Agentic MR sequence development: leveraging LLMs with MR skills for automatic physics- informed sequence development. arXiv:2604.13282 (2026),https://arxiv.org/ abs/2604.13282
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.