DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection

Chenyu Zhu; Derek F. Wong; Hao Zhang; Jinsong Su; Junchao Wu; Longyue Wang; Tianqi Shi; Weihua Luo; Yefeng Liu; Yichao Du

arxiv: 2605.15518 · v2 · pith:4RKB5CXXnew · submitted 2026-05-15 · 💻 cs.CL

DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection

Junchao Wu , Yefeng Liu , Chenyu Zhu , Hao Zhang , Zeyu Wu , Tianqi Shi , Yichao Du , Longyue Wang

show 3 more authors

Weihua Luo Jinsong Su Derek F. Wong

This is my paper

Pith reviewed 2026-05-20 19:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM-generated text detectionmultilingual benchmarkreal-world AI writingparaphrasing attacksperturbation attackscommercial LLMsAI content governancelanguage-specific detectors

0 comments

The pith

DetectRL-X benchmark reveals strengths and limitations of current LLM text detectors across eight languages and real-world writing scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DetectRL-X as a benchmark that spans eight common commercial languages and six high-risk domains for LLM misuse. It generates texts from four popular commercial LLMs and incorporates authentic operations such as polishing, expanding, condensing, plus multilingual paraphrasing and perturbation attacks. Experiments using this setup test how well existing detectors handle linguistic diversity and typical human modifications. A sympathetic reader cares because reliable detection matters for governing AI content and reducing misuse risks in global contexts. The results map where detectors succeed or break down depending on language, domain, generator, and attack type.

Core claim

By constructing DetectRL-X with texts in eight languages from six domains produced by four commercial LLMs, along with refinement operations and a framework for paraphrasing and perturbation attacks, the experiments demonstrate the varying performance of state-of-the-art detectors and establish the benchmark as a tool for improving multilingual and language-specific detection methods.

What carries the argument

DetectRL-X, the benchmark that combines 8 languages, 6 domains, 4 LLMs, refinement operations, and multilingual paraphrasing/perturbation attack frameworks to stress-test detectors.

If this is right

Detector accuracy varies with language, domain, generator, text length, and refinement operations.
Paraphrasing and perturbation attacks can reduce detection reliability in multiple languages.
Current detectors show different strengths and weaknesses when applied to diverse linguistic resources.
The benchmark supports development of stronger language-specific detectors.
Analysis of how attacks and operations influence performance guides future detector improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future detectors could incorporate language-specific training data drawn from the benchmark's attack patterns.
The benchmark structure might extend to additional languages or emerging LLMs to track evolving detection challenges.
Policy efforts around AI content could use similar cross-language testing to set detection standards.
Real-world deployment of detectors may require ongoing updates based on new refinement operations not covered here.

Load-bearing premise

The selected languages, domains, LLMs, and specific attack strategies sufficiently represent authentic multilingual and real-world AI-assisted writing.

What would settle it

A detector achieving uniformly high accuracy across all eight languages, all six domains, all four generators, and all attack types in DetectRL-X would indicate that the benchmark does not reveal meaningful limitations.

Figures

Figures reproduced from arXiv: 2605.15518 by Chenyu Zhu, Derek F. Wong, Hao Zhang, Jinsong Su, Junchao Wu, Longyue Wang, Tianqi Shi, Weihua Luo, Yefeng Liu, Yichao Du, Zeyu Wu.

**Figure 1.** Figure 1: An overview of the structure of the DetectRL-X. The benchmark comprises 3.46 million samples, making it the largest known multilingual LGT detection dataset. It includes 8 languages, 6 domains, 4 generators, 8 attack scenarios, 4 text-length granularities, and 3 types of refinement operations. The benchmark supports Binary and Ternary classification tasks, and includes 8 evaluation dimensions comparing 12 … view at source ↗

**Figure 2.** Figure 2: In-Distribution Performance of Different Detectors on Different Languages. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Generalization Performance of Different De [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Multilingual Performance of Detectors Across Training Domains. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 6.** Figure 6: Text length distribution of DetectRL-X [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: N-gram distribution of DetectRL-X. all of which exhibit more constrained lexical patterns. Linguistically, languages like French, German, and Portuguese display high n-gram richness, whereas Chinese is a notable outlier with exceptionally low diversity. Across all these comparisons, however, the distributions for LLM-refined and LLM-generated texts remain remarkably consistent and nearly identical. Re… view at source ↗

**Figure 8.** Figure 8: Readability distribution of DetectRL-X. across content categories and languages. We observed a distinct hierarchy in readability among the categories, with Novel texts being the most accessible and Academic writing proving to be the least. Furthermore, the analysis by language revealed a substantial gap: both Arabic and Russian achieved unexpectedly high readability scores, whereas German and Portuguese… view at source ↗

**Figure 9.** Figure 9: Lexical diversity distribution of DetectRL-X [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

read the original abstract

The effective detection and governance of Large Language Model (LLM) generated content has become increasingly critical due to the growing risk of misuse. Despite the impressive performance of existing detectors, their reliability and potential in multilingual, real-world scenarios remain largely underexplored. In this study, we introduce DetectRL-X, a comprehensive multilingual benchmark designed to evaluate advanced detectors across 8 dimensions. The benchmark encompasses 8 languages commonly used in commercial contexts and collects human-written texts from 6 domains highly susceptible to LLM misuse. To better aligned with real-world applications, We create LLM-generated texts using 4 popular commercial LLMs, and include typical AI-assisted writing operations such as polishing, expanding, and condensing to capture authentic usage patterns. Furthermore, we develop a multilingual framework for paraphrasing and perturbation attacks to simulate diverse human modifications and writing noise, enabling stress testing of detectors across languages. Experimental results on DetectRL-X reveal the strengths and limitations of current state-of-the-art detectors when applied to diverse linguistic resources. We further analyze how domains, generators, attack strategies, text length, and refinement operations influence performance in different languages, underscoring DetectRL-X as an effective benchmark for strengthening multilingual and language-specific detectors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DetectRL-X gives a practical multilingual benchmark with real domains and commercial generators, but the attack framework lacks clear per-language validation so results on limitations may mix linguistic effects with simulation artifacts.

read the letter

The main point for you is that this paper ships a new dataset covering eight languages, six misuse-prone domains, four commercial LLMs, and a set of refinement operations plus paraphrasing and perturbation attacks. That combination is the actual addition; prior English-only benchmarks did not pull these pieces together at this scale. The experiments then run existing detectors across the collection and break down performance by language, domain, generator, attack type, length, and refinement. Those breakdowns are the useful output and show clear variation that anyone building detectors would want to see. The authors also release the data, which matters for follow-up work. On the positive side, the choice of commercial models and the inclusion of polishing, expanding, and condensing moves the test closer to how people actually use LLMs than pure zero-shot generation does. The domain selection also targets areas where misuse is plausible rather than generic text. That part is straightforward and worth having. The softer spot is the multilingual attack framework. The claim that it reveals genuine strengths and limitations across languages rests on the attacks producing comparably natural output in every language. The paper does not report native-speaker naturalness ratings, linguistic-feature preservation checks, or side-by-side comparison with real human edits broken down by language. If the perturbations are less fluent or more repetitive in lower-resource languages among the eight, then some of the reported performance drops could trace to attack quality rather than to the detectors themselves. That does not invalidate the whole benchmark, but it does mean the language-specific conclusions need extra caution until that validation is added. The work is aimed at groups that evaluate or improve detection systems for non-English settings. A reader who needs a ready-made stress-test collection will find concrete numbers and failure modes to build on. It is coherent on its own terms and shows honest engagement with the practical problem, so it deserves a serious referee even if the attack validation section needs strengthening before publication.

Referee Report

1 major / 1 minor

Summary. The paper introduces DetectRL-X, a multilingual benchmark for evaluating LLM-generated text detectors. It covers 8 languages, collects human-written texts from 6 domains, generates texts using 4 commercial LLMs, incorporates real-world AI-assisted operations (polishing, expanding, condensing), and includes a multilingual framework for paraphrasing and perturbation attacks to simulate human modifications and writing noise. Experiments analyze how domains, generators, attacks, length, and refinements affect detector performance across languages, claiming to reveal strengths and limitations of current detectors.

Significance. If the benchmark construction and attack fidelity are sound, this work provides a valuable resource for improving multilingual LLM detectors, addressing a clear gap as LLMs see global use. The emphasis on commercial models and typical writing operations strengthens its relevance to real-world scenarios.

major comments (1)

[Multilingual framework for paraphrasing and perturbation attacks] Description of the multilingual framework for paraphrasing and perturbation attacks: The central claim that results reveal genuine strengths/limitations 'across diverse linguistic resources' and that the framework 'captures authentic usage patterns' depends on the attacks producing comparably natural modifications in each of the 8 languages. The manuscript provides no per-language validation evidence (e.g., native-speaker naturalness ratings, linguistic feature preservation metrics, or comparison to real human edits). If fidelity varies (especially in lower-resource languages), observed performance drops may reflect attack artifacts rather than detector limitations, undermining the benchmark's reliability for language-specific conclusions.

minor comments (1)

[Abstract] Grammatical issues in the abstract: 'To better aligned with real-world applications, We create' should be corrected to 'To better align with real-world applications, we create'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We appreciate the constructive criticism provided, which helps improve the quality and reliability of our benchmark. Below, we address the major comment regarding the multilingual attack framework.

read point-by-point responses

Referee: [Multilingual framework for paraphrasing and perturbation attacks] Description of the multilingual framework for paraphrasing and perturbation attacks: The central claim that results reveal genuine strengths/limitations 'across diverse linguistic resources' and that the framework 'captures authentic usage patterns' depends on the attacks producing comparably natural modifications in each of the 8 languages. The manuscript provides no per-language validation evidence (e.g., native-speaker naturalness ratings, linguistic feature preservation metrics, or comparison to real human edits). If fidelity varies (especially in lower-resource languages), observed performance drops may reflect attack artifacts rather than detector limitations, undermining the benchmark's reliability for language-specific conclusions.

Authors: We acknowledge the importance of validating the fidelity of our paraphrasing and perturbation attacks across the 8 languages to ensure that observed effects are not artifacts. The manuscript currently describes the framework using multilingual models but lacks explicit per-language validation. In the revised version, we will incorporate linguistic feature preservation metrics (such as semantic similarity via multilingual embeddings and syntactic similarity scores) computed for each language. We will also add a discussion on how the attack models were selected based on their established multilingual capabilities. This addition will support our claims about capturing authentic usage patterns and allow for more robust language-specific conclusions. We believe these changes address the concern effectively. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or fitted predictions

full rationale

The paper introduces DetectRL-X as a new multilingual benchmark through data collection (human texts from 6 domains, generations from 4 LLMs, and paraphrasing/perturbation operations) followed by direct experimental evaluation of existing detectors. No equations, first-principles derivations, or statistical predictions appear in the provided abstract or setup; results are reported as outcomes of running detectors on the constructed dataset rather than any quantity derived from or equivalent to its inputs by construction. Self-citations, if present in the full text, are not load-bearing for any core claim, and the work is self-contained as an empirical resource without reducing to renamed fits or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark construction depends on author-selected languages, domains, and attack methods that are not derived from first principles but chosen to approximate real-world conditions.

axioms (1)

domain assumption The 8 selected languages and 6 domains represent high-risk commercial and misuse scenarios.
Stated in abstract as basis for data collection.

pith-pipeline@v0.9.0 · 5781 in / 1034 out tokens · 31272 ms · 2026-05-20T19:47:27.754453+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop a multilingual framework for paraphrasing and perturbation attacks... 8 languages, 6 domains, 4 generators, 8 attack strategies, 4 text-length granularities, and 3 types of refinement operations
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experimental results on DetectRL-X reveal the strengths and limitations of current state-of-the-art detectors

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 3 internal anchors

[1]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Spotting llms with binoculars: Zero-shot detection of machine-generated text. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. Deberta: decoding-enhanced bert with disentangled attention. In 9th International Confer- ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Release Strategies and the Social Impacts of Language Models

Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics. Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu,...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Qwen2.5 Technical Report

Detectllm: Leveraging log rank information for zero-shot detection of machine-generated text. In Findings of the Association for Computational Lin- guistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 12395–12412. Association for Computa- tional Linguistics. Adaku Uchendu, Zeyu Ma, Thai Le, Rui Zhang, and Dongwon Lee. 2021. TURINGBENCH: A bench- ma...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Spotting llms with binoculars: Zero-shot detection of machine-generated text. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. Deberta: decoding-enhanced bert with disentangled attention. In 9th International Confer- ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Release Strategies and the Social Impacts of Language Models

Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics. Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu,...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[3] [3]

Qwen2.5 Technical Report

Detectllm: Leveraging log rank information for zero-shot detection of machine-generated text. In Findings of the Association for Computational Lin- guistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 12395–12412. Association for Computa- tional Linguistics. Adaku Uchendu, Zeyu Ma, Thai Le, Rui Zhang, and Dongwon Lee. 2021. TURINGBENCH: A bench- ma...

work page internal anchor Pith review Pith/arXiv arXiv 2023