MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization

Songze Li; Xia Hu; Yongtong Gu

arxiv: 2601.08564 · v2 · submitted 2026-01-13 · 💻 cs.CR

MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization

Yongtong Gu , Songze Li , Xia Hu This is my paper

Pith reviewed 2026-05-16 15:06 UTC · model grok-4.3

classification 💻 cs.CR

keywords AI-generated text detectionevasion attackblack-box attackstyle transfertext humanizationadversarial robustnesslarge language models

0 comments

The pith

MASH evades black-box AI-generated text detectors by aligning their style distributions to human writing through sequential fine-tuning stages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MASH as a practical way to make AI-generated text evade black-box detectors by reshaping its style to match human text. It applies style-injection supervised fine-tuning first, then direct preference optimization, and finally inference-time refinement to shift output distributions closer to human ones. Experiments on six datasets and five detectors show this reaches 92 percent average attack success rate, beating the best prior methods by 24 percent on average, while producing higher-quality language than alternatives. A reader would care because it demonstrates that current detectors remain vulnerable in realistic settings where attackers have only query access and no internal model details.

Core claim

MASH is a multi-stage framework that evades black-box AIGT detectors by using style-injection supervised fine-tuning, direct preference optimization, and inference-time refinement to make AI-generated text distributions resemble those of human-written texts, achieving an average 92 percent attack success rate across six datasets and five detectors while preserving superior linguistic quality compared to eleven baseline evaders.

What carries the argument

Multi-stage Alignment for Style Humanization (MASH), a sequential pipeline of style-injection supervised fine-tuning followed by direct preference optimization and inference-time refinement that performs style transfer to match human text distributions under black-box access.

If this is right

Black-box AI text detectors are unreliable against practical style-based evasion attacks that require only output access.
Sequential alignment stages can shift AI text distributions toward human ones while keeping linguistic quality higher than prior attack methods.
High attack success rates hold across multiple datasets and detector types, indicating broad applicability of the style humanization approach.
Adversarial evasion can be achieved with lower computational and interaction costs than white-box or high-query methods assumed in earlier work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Detector designers should test robustness against outputs that have undergone multi-stage style alignment rather than raw AI generation.
Future detection systems may need to move beyond surface-level style cues toward deeper structural or semantic invariants that survive humanization.
Widespread use of such methods could make reliable distinction between AI and human text harder in settings like content moderation or academic integrity checks.
Evaluating MASH against detectors that have been fine-tuned on previously humanized examples would clarify whether the evasion remains stable over time.

Load-bearing premise

That applying these sequential style adjustments will consistently make AI text match human distributions without creating fresh patterns that detectors can still identify.

What would settle it

A detector retrained on a large set of MASH-generated examples would achieve substantially lower attack success rates if the method only shifts style without introducing new detectable signals.

read the original abstract

The increasing misuse of AI-generated texts (AIGT) has motivated the rapid development of AIGT detection methods. However, the reliability of these detectors remains fragile against adversarial evasions. Existing attack strategies often rely on white-box assumptions or demand prohibitively high computational and interaction costs, rendering them ineffective under practical black-box scenarios. In this paper, we propose Multi-stage Alignment for Style Humanization (MASH), a novel framework that evades black-box detectors based on style transfer. MASH sequentially employs style-injection supervised fine-tuning, direct preference optimization, and inference-time refinement to shape the distributions of AI-generated texts to resemble those of human-written texts. Experiments across 6 datasets and 5 detectors demonstrate the superior performance of MASH over 11 baseline evaders. Specifically, MASH achieves an average Attack Success Rate (ASR) of 92%, surpassing the strongest baselines by an average of 24%, while maintaining superior linguistic quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MASH gets 92% ASR on its five tested detectors with a staged SFT-DPO-refinement pipeline, but the results look tied to those specific models rather than proving general black-box evasion.

read the letter

The main point on this paper is that MASH combines style-injection supervised fine-tuning, direct preference optimization, and inference refinement to shift AI text distributions, reporting 92% average attack success rate across six datasets and beating the strongest of eleven baselines by 24% while keeping linguistic quality higher. That specific sequence applied to black-box style humanization is the concrete new piece they put forward, and the multi-dataset testing plus quality metrics give it some empirical grounding beyond just one setting. The numbers on the five detectors they used are the clearest evidence they provide. The main limitation is that all the reported success stays inside those same five detectors. Nothing in the abstract or the stress-test details shows results on held-out detectors or arbitrary black-box systems, so it is possible the optimization is picking up on quirks of exactly those decision boundaries instead of producing a detector-agnostic human-like distribution. If the preference pairs were built by querying the same models, that would make the high ASR less surprising but also less general. This work is aimed at people building or attacking AI-text detectors in security and NLP settings. It has enough structure and reported scale to merit a serious referee who can check the full experimental protocol, baseline implementations, and ask for cross-detector tests. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes Multi-stage Alignment for Style Humanization (MASH), a framework that sequentially applies style-injection supervised fine-tuning, direct preference optimization, and inference-time refinement to shift AI-generated text distributions toward human-written style, thereby evading black-box AIGT detectors. Experiments on 6 datasets and 5 detectors report an average attack success rate of 92%, outperforming 11 baselines by 24% on average while preserving linguistic quality.

Significance. If the results hold under broader validation, MASH would provide a concrete, low-interaction-cost attack demonstrating that current black-box detectors remain vulnerable to style-based humanization, motivating the development of detectors robust to distribution shifts induced by preference optimization. The multi-stage pipeline offers a reusable template for future adversarial evaluations in this area.

major comments (2)

[Experiments (Section 4)] The evaluation is confined to 5 fixed detectors with no held-out or cross-detector results reported. Because DPO preference pairs are constructed via detector queries, the 92% ASR may reflect overfitting to the specific decision boundaries of those detectors rather than a detector-agnostic humanization effect; a concrete test would require reporting ASR on at least two additional unseen detectors.
[Results and Analysis (Section 5)] No ablation isolating the contribution of each stage (style-injection SFT, DPO, inference refinement) or statistical tests (e.g., significance of the 24% margin over baselines) appear; without these, the claim that the full pipeline is required for the reported ASR cannot be assessed.

minor comments (2)

[Abstract] The abstract states results across '6 datasets and 5 detectors' but does not name them; listing the specific detectors and datasets in the abstract or early in Section 4 would improve reproducibility.
[Related Work / Experiments] Baseline descriptions are referenced only by count (11 baselines); a table summarizing each baseline's core mechanism and ASR would clarify the 24% improvement margin.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the experimental validation and analysis as requested.

read point-by-point responses

Referee: [Experiments (Section 4)] The evaluation is confined to 5 fixed detectors with no held-out or cross-detector results reported. Because DPO preference pairs are constructed via detector queries, the 92% ASR may reflect overfitting to the specific decision boundaries of those detectors rather than a detector-agnostic humanization effect; a concrete test would require reporting ASR on at least two additional unseen detectors.

Authors: We acknowledge this concern about potential overfitting, as the DPO stage does rely on queries to the five detectors used in training. While the core goal of MASH is style humanization to shift toward human-written distributions (which we expect to generalize), we agree that explicit held-out testing is needed to confirm detector-agnostic performance. In the revised manuscript, we will add ASR results on at least two additional unseen black-box detectors and discuss the implications for generalizability. revision: yes
Referee: [Results and Analysis (Section 5)] No ablation isolating the contribution of each stage (style-injection SFT, DPO, inference refinement) or statistical tests (e.g., significance of the 24% margin over baselines) appear; without these, the claim that the full pipeline is required for the reported ASR cannot be assessed.

Authors: We agree that ablations and statistical tests would better substantiate the necessity of the full multi-stage pipeline. We will add comprehensive ablations isolating the impact of style-injection SFT, DPO, and inference-time refinement on ASR and linguistic quality metrics. Additionally, we will include statistical significance tests (e.g., paired t-tests with p-values) for the reported 24% average improvement over baselines in the revised paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results only

full rationale

The paper proposes an empirical attack framework (MASH) consisting of sequential SFT + DPO + inference refinement and reports measured ASR on 6 datasets against 5 fixed detectors. No equations, derivations, or load-bearing self-citations appear in the provided text. Claims are framed as experimental outcomes rather than reductions to fitted parameters or prior self-work. The evaluation is self-contained against the stated benchmarks with no theoretical chain that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven domain assumption that style humanization can consistently align distributions without new artifacts; no free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Style transfer via fine-tuning and preference optimization can make AI text distributions indistinguishable from human text under black-box detector queries.
This is the core premise enabling the evasion claim.

pith-pipeline@v0.9.0 · 5465 in / 1038 out tokens · 57776 ms · 2026-05-16T15:06:16.419370+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MASH sequentially employs style-injection supervised fine-tuning, direct preference optimization, and inference-time refinement to shape the distributions of AI-generated texts to resemble those of human-written texts.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments across 6 datasets and 5 detectors demonstrate the superior performance of MASH over 11 baseline evaders. Specifically, MASH achieves an average Attack Success Rate (ASR) of 92%.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Base Models Look Human To AI Detectors
cs.CL 2026-05 unverdicted novelty 7.0

Base model text evades AI detectors better than instruction-tuned text, and the HIP method strengthens this trade-off across model sizes.