TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices

Denny Vrande\v{c}i\'c; Elena Simperl; Elizabeth Black; Gerrit Quaremba

arxiv: 2605.31113 · v1 · pith:CW2DEUUYnew · submitted 2026-05-29 · 💻 cs.CL

TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices

Gerrit Quaremba , Elizabeth Black , Denny Vrande\v{c}i\'c , Elena Simperl This is my paper

Pith reviewed 2026-06-28 22:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords machine-generated text detectionWikipedia editingLLMbenchmarktask-specific generationgeneralization asymmetryuser-generated content

0 comments

The pith

SOTA detectors lose 10-40% accuracy on task-specific machine-generated text from real Wikipedia editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing MGT detection benchmarks overestimate performance because they rely on generic prompts rather than the constrained tasks editors actually perform on Wikipedia. This matters for knowledge integrity on user-generated content platforms, where LLMs are used for specific operations like summarization that produce text closer to human writing. TSM-Bench shows both the accuracy drop and a generalization asymmetry: fine-tuning on task-specific data transfers to generic cases across domains, but generic training does not transfer back. The results indicate that generic-trained models overfit to superficial generation artifacts and leave most detectors unreliable for real-world use.

Core claim

A range of SOTA MGT detectors struggle to identify task-specific MGT reflecting real-world editing on Wikipedia, with average detection accuracy dropping by 10--40% compared to prior benchmarks, and a generalisation asymmetry exists: fine-tuning on task-specific data enables generalisation to generic data even across domains but not vice versa.

What carries the argument

TSM-Bench, a multilingual multi-generator multi-task benchmark built from common Wikipedia editing tasks such as summarisation and expansion.

If this is right

Models fine-tuned exclusively on generic MGT overfit to superficial artefacts of machine generation.
Fine-tuning on task-specific data enables generalisation to generic data even across domains.
Most current detectors remain unreliable for automated detection in real-world UGC platforms.
TSM-Bench supplies a foundation for developing and evaluating more reliable future detectors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Detector training pipelines for UGC platforms should shift priority toward task-constrained examples rather than broad generic corpora.
The observed asymmetry suggests that evaluation on generic benchmarks alone is insufficient to certify detectors for practical deployment.
Comparable detection gaps may appear on other collaborative platforms that rely on constrained writing tasks.

Load-bearing premise

The task-specific MGT instances constructed for TSM-Bench accurately reflect how Wikipedia editors actually employ LLMs in practice.

What would settle it

Direct observation or logging of LLM-assisted edits on Wikipedia that produces text distributions measurably different from the benchmark's task-specific generations.

Figures

Figures reproduced from arXiv: 2605.31113 by Denny Vrande\v{c}i\'c, Elena Simperl, Elizabeth Black, Gerrit Quaremba.

**Figure 2.** Figure 2: Overview of TSM-BENCH: ⃝1 We define four editing tasks informed by research on how editors employ LLMs. ⃝2 For each task, we adopt two prompts from the natural language generation literature and automatically evaluate them against a simple baseline. ⃝3 Using the highest-scoring prompt, we generate MGT from six LMs. ⃝4 Finally, we run five experiments on these data and draw key conclusions about the effecti… view at source ↗

**Figure 3.** Figure 3: Comparison of off-the-shelf detectors on generic and task-specific MGT. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Out-of-domain accuracies of mDeBERTa by language with GPT-4o. Our dataset balances [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: SHAP features for mDeBERTa. Fine-tuning on generic data tends to overfit to surface-level features. To analyse the results of Experiments 1-3, we compare features learned by mDeBERTa when trained on generic versus taskspecific English Wikipedia data [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Cross-task accuracies of mDeBERTa by language with GPT-4o. IP = Introductory Para [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of textual characteristics between human, generic, and our task-specific MGT [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of textual characteristics between human, generic, and our task-specific MGT [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Precision–recall curves across tasks and languages (with GPT-4o as the generator. [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: Out-of-domain accuracies of mDeBERTa by language with Qwen 2.5. Our dataset bal [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗

**Figure 11.** Figure 11: Cross-task domain accuracies of mDeBERTa by language with Qwen 2.5. [PITH_FULL_IMAGE:figures/full_fig_p032_11.png] view at source ↗

read the original abstract

Automatically detecting machine-generated text (MGT) is critical to maintaining the knowledge integrity of user-generated content (UGC) platforms such as Wikipedia. Existing detection benchmarks primarily focus on \textit{generic} text generation tasks (e.g., ``Write an article about machine learning.''). However, editors frequently employ LLMs for specific writing tasks (e.g., summarisation). These \textit{task-specific} MGT instances tend to resemble human-written text more closely due to their constrained task formulation and contextual conditioning. In this work, we show that a range of SOTA MGT detectors struggle to identify task-specific MGT reflecting real-world editing on Wikipedia. We introduce \textsc{TSM-Bench}, a multilingual, multi-generator, and \textit{multi-task} benchmark for evaluating MGT detectors on common, real-world Wikipedia editing tasks. Our findings demonstrate that (\textit{i}) average detection accuracy drops by 10--40\% compared to prior benchmarks, and (\textit{ii}) a generalisation asymmetry exists: fine-tuning on task-specific data enables generalisation to generic data -- even across domains -- but not vice versa. We demonstrate that models fine-tuned exclusively on generic MGT overfit to superficial artefacts of machine generation. Our results suggest that, in contrast to prior benchmarks, most detectors remain unreliable for automated detection in real-world contexts such as UGC platforms. \textsc{TSM-Bench} therefore provides a critical foundation for developing and evaluating future models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TSM-Bench documents real performance drops on Wikipedia task-specific text but rests on an unvalidated claim that its generations match actual editor LLM use.

read the letter

The key point here is that detectors lose substantial accuracy on text produced for concrete Wikipedia tasks like summarization or expansion, and fine-tuning on those tasks transfers better than the reverse. That asymmetry and the 10-40% drop are the concrete results worth noting.

The paper's main contribution is TSM-Bench itself: a multilingual, multi-generator set of examples built around editing workflows rather than open-ended generation. It is a clear step beyond the generic benchmarks cited in the abstract, and the reported generalization pattern is worth testing further because it points to overfitting on superficial generation cues.

The soft spot is the missing link between the benchmark tasks and real Wikipedia practice. The abstract states the tasks are "common, real-world" but supplies no edit-log statistics, editor surveys, or prompt-distribution checks to support that. If the conditioning or length constraints differ from how editors actually prompt models, the accuracy gap and asymmetry could be specific to this construction rather than a general property of task-specific MGT.

This is a paper for people working on detection for user-generated content platforms. Anyone evaluating detectors for deployment on Wikipedia-scale data would want to see the benchmark and the asymmetry result. The work shows clear thinking on the evaluation setup and does not rely on circular self-citation.

I would send it to peer review. The empirical claims are falsifiable and the benchmark could be adopted if the realism question is addressed.

Referee Report

2 major / 2 minor

Summary. The paper introduces TSM-Bench, a multilingual, multi-generator, multi-task benchmark for MGT detection focused on common Wikipedia editing tasks such as summarization and expansion. It claims that SOTA detectors exhibit 10-40% lower accuracy on these task-specific instances than on prior generic benchmarks, demonstrates a generalization asymmetry (fine-tuning on task-specific data transfers to generic data across domains, but not vice versa), and concludes that existing detectors remain unreliable for real-world UGC platforms like Wikipedia.

Significance. If the constructed tasks accurately mirror real-world LLM-assisted editing workflows, the reported accuracy drops and asymmetry would provide concrete evidence that current detectors overfit to generic generation artifacts and would establish TSM-Bench as a necessary resource for developing detectors suitable for UGC integrity. The work also supplies an empirical basis for preferring task-specific fine-tuning in detector training.

major comments (2)

[Introduction / Benchmark Construction] Introduction and benchmark construction section: The central claim that TSM-Bench reflects 'real-world Wikipedia editing practices' and that the observed 10-40% accuracy drop therefore indicates unreliability 'in real-world contexts such as UGC platforms' is load-bearing, yet the manuscript supplies no edit-log analysis, editor survey, or distributional comparison between the chosen tasks/prompts and actual editor LLM usage to ground this premise.
[Results / Generalization Experiments] Results section on generalization: The reported asymmetry (task-specific fine-tuning generalizes to generic data but not vice versa) and the claim that generic-only models 'overfit to superficial artefacts' are presented as key findings, but the manuscript does not report the precise statistical tests, confidence intervals, or ablation controls used to establish that the asymmetry is not an artifact of the particular task set or data splits.

minor comments (2)

[Abstract / Results Tables] The abstract states quantitative drops without accompanying error bars or per-detector breakdowns; adding these to the main results tables would improve interpretability.
[Benchmark Description] Notation for the multi-task categories (e.g., summarization vs. expansion) should be defined once with explicit examples of the Wikipedia-specific conditioning used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Introduction / Benchmark Construction] Introduction and benchmark construction section: The central claim that TSM-Bench reflects 'real-world Wikipedia editing practices' and that the observed 10-40% accuracy drop therefore indicates unreliability 'in real-world contexts such as UGC platforms' is load-bearing, yet the manuscript supplies no edit-log analysis, editor survey, or distributional comparison between the chosen tasks/prompts and actual editor LLM usage to ground this premise.

Authors: We agree that stronger empirical grounding would be valuable. The tasks were chosen based on common Wikipedia editing activities described in prior literature on collaborative editing workflows. We did not perform a new edit-log analysis or editor survey. In revision we will add a subsection detailing the task selection rationale with supporting citations, moderate phrasing from 'real-world Wikipedia editing practices' to 'common task-specific Wikipedia editing tasks', and note the absence of direct usage statistics as a limitation. A full distributional study lies outside the current scope. revision: partial
Referee: [Results / Generalization Experiments] Results section on generalization: The reported asymmetry (task-specific fine-tuning generalizes to generic data but not vice versa) and the claim that generic-only models 'overfit to superficial artefacts' are presented as key findings, but the manuscript does not report the precise statistical tests, confidence intervals, or ablation controls used to establish that the asymmetry is not an artifact of the particular task set or data splits.

Authors: We appreciate this observation. The asymmetry was consistent across multiple random seeds and data partitions, but we did not report formal tests or intervals. In the revised version we will add confidence intervals for all accuracy figures, paired statistical tests (e.g., McNemar or t-tests) with p-values, and ablation results across alternative splits and task subsets to confirm robustness. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark results independent of inputs

full rationale

The paper introduces TSM-Bench as a new multi-task dataset of LLM generations for Wikipedia editing tasks and reports detector performance metrics on it. These are direct empirical measurements (accuracy drops, generalization tests) obtained by running existing detectors on the constructed data; no derivation reduces a claimed result to a fitted parameter, self-defined quantity, or self-citation chain. The central claims rest on observable performance numbers rather than any equation or premise that is true by construction from the benchmark itself. The representativeness assumption is a validity concern but does not create circularity in the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no free parameters, axioms, or invented entities are mentioned or required by the central claims.

pith-pipeline@v0.9.1-grok · 5810 in / 1128 out tokens · 25952 ms · 2026-06-28T22:50:05.859890+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 5 canonical work pages · 2 internal anchors

[1]

URLhttps://aclanthology.org/ 2024.acl-long.674

Association for Computational Linguistics. URLhttps://aclanthology.org/ 2024.acl-long.674. Emdemor. News of the brazilian newspaper, 2023. URLhttps://huggingface.co/ datasets/emdemor/news-of-the-brazilian-newspaper. Accessed: 2025-05-10. Alexander Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. QAFactEval: Improved QA-based factual consistency eval...

work page doi:10.18653/v1/2022.naacl-main.187 2024
[2]

Retrieval-Augmented Generation for Large Language Models: A Survey

URLhttps://arxiv.org/abs/2312.10997. Demian Gholipour Ghalandari, Chris Hokamp, Nghia The Pham, John Glover, and Georgiana Ifrim. A large-scale multi-document summarization dataset from the Wikipedia current events portal. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.),Proceedings of the 58th Annual Meeting of the Association for...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

doi: 10.18653/v1/2020.acl-main.120

Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.120. URL https://aclanthology.org/2020.acl-main.120/. Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yu- peng Wu. How close is chatgpt to human experts? comparison corpus, evaluation, and detection,

work page doi:10.18653/v1/2020.acl-main.120 2020
[4]

instruction

URLhttps://arxiv.org/abs/2301.07597. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. Hanxi Guo, Siyuan Cheng, Xiaolong Jin, Zhuo Zhang, Kaiyuan Zhang, Guanho...

work page doi:10.18653/v1/ 2025
[5]

A Unified Approach to Interpreting Model Predictions

Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.179. URL https://aclanthology.org/2024.naacl-long.179/. Scott Lundberg and Su-In Lee. A unified approach to interpreting model predictions, 2017. URL https://arxiv.org/abs/1705.07874. Dominik Macko, Robert Moro, Adaku Uchendu, Jason Lucas, Michiharu Yamashita, Matúš Piku- liak, Iv...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.naacl-long.179 2024
[6]

{title}", the article should at least have 250 words. CNN/DM Write a news article given the following highlights:

Each entry includes a title and the main article body, along with additional metadata. SOCIALREVIEWS YelpThe Yelp dataset (Zhang et al., 2015) is a large-scale collection of approximately 700,000 business reviews written on the Yelp platform. It covers businesses across eight metropolitan areas in the United States and Canada. B2WB2W-Reviews01 (Real et al...

2015

[1] [1]

URLhttps://aclanthology.org/ 2024.acl-long.674

Association for Computational Linguistics. URLhttps://aclanthology.org/ 2024.acl-long.674. Emdemor. News of the brazilian newspaper, 2023. URLhttps://huggingface.co/ datasets/emdemor/news-of-the-brazilian-newspaper. Accessed: 2025-05-10. Alexander Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. QAFactEval: Improved QA-based factual consistency eval...

work page doi:10.18653/v1/2022.naacl-main.187 2024

[2] [2]

Retrieval-Augmented Generation for Large Language Models: A Survey

URLhttps://arxiv.org/abs/2312.10997. Demian Gholipour Ghalandari, Chris Hokamp, Nghia The Pham, John Glover, and Georgiana Ifrim. A large-scale multi-document summarization dataset from the Wikipedia current events portal. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.),Proceedings of the 58th Annual Meeting of the Association for...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

doi: 10.18653/v1/2020.acl-main.120

Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.120. URL https://aclanthology.org/2020.acl-main.120/. Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yu- peng Wu. How close is chatgpt to human experts? comparison corpus, evaluation, and detection,

work page doi:10.18653/v1/2020.acl-main.120 2020

[4] [4]

instruction

URLhttps://arxiv.org/abs/2301.07597. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. Hanxi Guo, Siyuan Cheng, Xiaolong Jin, Zhuo Zhang, Kaiyuan Zhang, Guanho...

work page doi:10.18653/v1/ 2025

[5] [5]

A Unified Approach to Interpreting Model Predictions

Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.179. URL https://aclanthology.org/2024.naacl-long.179/. Scott Lundberg and Su-In Lee. A unified approach to interpreting model predictions, 2017. URL https://arxiv.org/abs/1705.07874. Dominik Macko, Robert Moro, Adaku Uchendu, Jason Lucas, Michiharu Yamashita, Matúš Piku- liak, Iv...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.naacl-long.179 2024

[6] [6]

{title}", the article should at least have 250 words. CNN/DM Write a news article given the following highlights:

Each entry includes a title and the main article body, along with additional metadata. SOCIALREVIEWS YelpThe Yelp dataset (Zhang et al., 2015) is a large-scale collection of approximately 700,000 business reviews written on the Yelp platform. It covers businesses across eight metropolitan areas in the United States and Canada. B2WB2W-Reviews01 (Real et al...

2015