arxiv: 2604.10101 · v2 · submitted 2026-04-11 · 💻 cs.CL

Recognition: unknown

Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry

Jiang Li , Tian Lan , Shanshan Wang , Dongxing Zhang , Dianqing Lin , Guanglai Gao , Derek F. Wong , Xiangdong Su

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM-generated text detectionclassical Chinese poetryAI text detectorsbenchmark datasetliterary text generationmetrical constraintsChangAn benchmark

0 comments

The pith

Current Chinese text detectors cannot reliably identify LLM-generated classical Chinese poetry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates ChangAn, a benchmark of 30,664 classical Chinese poems that includes 10,276 human-written examples and 20,388 poems produced by four popular LLMs under varied prompting. It then runs 12 existing AI text detectors on this collection and measures how well they separate the two classes at different text lengths and generation settings. The tests reveal consistently weak performance, with detectors often performing near chance level when faced with the strict meter, shared imagery, and flexible syntax of classical poetry. This outcome shows that off-the-shelf detection methods developed for general Chinese text do not transfer to this literary domain.

Core claim

ChangAn supplies a balanced dataset of human and LLM-generated classical Chinese poetry and demonstrates that twelve current detectors achieve low accuracy and F1 scores across granularities and prompting strategies, confirming that these tools are not yet reliable for identifying machine-written poems in this form.

What carries the argument

The ChangAn benchmark, a paired collection of human-written and LLM-generated classical Chinese poems used to test detector robustness under metrical and imagery constraints.

If this is right

Detection methods must incorporate explicit modeling of classical poetic rules such as tone patterns and rhyme constraints.
Existing general-purpose Chinese detectors require domain-specific retraining or feature engineering before they can be applied to literary texts.
Without improved detectors, verification of authorship for AI-assisted classical poetry remains unreliable.
The benchmark enables direct comparison of future detectors against a fixed, publicly available test set.
Performance gaps are larger for longer poems and certain prompting strategies, suggesting targeted evaluation protocols.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar detection difficulties may appear in other strictly formatted literary traditions such as regulated verse in other languages.
The benchmark could be extended with additional LLMs or fine-tuned poetry models to test whether newer generators close the gap.
Human experts might still distinguish the poems more accurately than machines, pointing to a role for hybrid human-AI verification.
Releasing the dataset invites community efforts to build poetry-aware detectors that exploit metrical regularity as a signal.

Load-bearing premise

The poems from the four selected LLMs and the twelve chosen detectors adequately represent the range of possible LLM outputs and detection methods for classical Chinese poetry.

What would settle it

A detector that reaches above 85 percent accuracy on held-out ChangAn poems generated by the same four LLMs would directly contradict the reported failure of current tools.

Figures

Figures reproduced from arXiv: 2604.10101 by Derek F. Wong, Dianqing Lin, Dongxing Zhang, Guanglai Gao, Jiang Li, Shanshan Wang, Tian Lan, Xiangdong Su.

**Figure 2.** Figure 2: Cross-model recall performance for poetry [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of classical Chinese poetry from human and various LLMs. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Prompts for generation, refinement, and de [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Prompts for direct generation and critique-driven refinement [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Semantic Clustering of AI vs. Human Poetry [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Visual comparison of poetic imagery distributions. The left column (a, c) shows human compositions, [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: The example of the process of critique-driven refinement. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

The rapid development of large language models (LLMs) has extended text generation tasks into the literary domain. However, AI-generated literary creations has raised increasingly prominent issues of creative authenticity and ethics in literary world, making the detection of LLM-generated literary texts essential and urgent. While previous works have made significant progress in detecting AI-generated text, it has yet to address classical Chinese poetry. Due to the unique linguistic features of classical Chinese poetry, such as strict metrical regularity, a shared system of poetic imagery, and flexible syntax, distinguishing whether a poem is authored by AI presents a substantial challenge. To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs. Based on ChangAn, we conducted a systematic evaluation of 12 AI detectors, investigating their performance variations across different text granularities and generation strategies. Our findings highlight the limitations of current Chinese text detectors, which fail to serve as reliable tools for detecting LLM-generated classical Chinese poetry. These results validate the effectiveness and necessity of our proposed ChangAn benchmark. Our dataset and code are available at https://github.com/VelikayaScarlet/ChangAn.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds the first sizable benchmark for LLM-generated classical Chinese poetry and reports that 12 detectors struggle with it, but the evaluation leaves the strength of that negative result unclear.

read the letter

This paper builds the first sizable benchmark for LLM-generated classical Chinese poetry and reports that 12 detectors struggle with it, but the evaluation leaves the strength of that negative result unclear. They assembled ChangAn with 30,664 poems, roughly two-thirds generated by four LLMs under different prompting strategies and the rest human-written. The evaluation then checks detector performance across text lengths and generation methods. That fills a documented gap, since earlier AI-text detection work skipped this constrained literary form with its metrical rules and shared imagery. Releasing the data and code is a practical step that lets others test improvements directly. The setup itself is sensible for a first pass at the problem. The soft spots sit in the missing specifics. The abstract states that current Chinese detectors fail as reliable tools, yet supplies no accuracy numbers, confidence intervals, threshold details, or ablation results on the prompting choices. The stress-test point holds: four LLMs and 12 detectors may not capture the hardest cases or the best available methods, and classical poetry's constraints could allow stronger adversarial examples with retrieval or style-tuned prompting. No human baseline appears either, so it is hard to judge how much of the gap is detector-specific versus inherent to the domain. This work is mainly for researchers building detectors for literary or Chinese text, or for anyone who needs a ready dataset in an underexplored corner of the field. A reader focused on benchmark construction or non-English detection would extract value from the resource even if the conclusions stay provisional. It deserves peer review. The benchmark is new and the domain question is real, so an editor should send it out with requests for the quantitative results, threshold justifications, and checks on representativeness.

Referee Report

3 major / 3 minor

Summary. The paper introduces the ChangAn benchmark dataset containing 30,664 classical Chinese poems (10,276 human-written and 20,388 generated by four LLMs using various strategies). It evaluates 12 existing AI text detectors on this dataset, analyzing performance variations across text granularities and generation methods, and concludes that current Chinese text detectors are limited and fail to reliably detect LLM-generated classical Chinese poetry due to the genre's unique metrical, imagery, and syntactic features. The dataset and code are released publicly.

Significance. If the evaluation is robust, this work would be significant for addressing a gap in AI-generated text detection for classical Chinese poetry, a domain with strict formal constraints that differ from modern prose. The public release of the benchmark and code is a clear strength, enabling reproducible follow-up research and development of specialized detectors. It contributes to discussions on authenticity and ethics in AI-assisted literary creation.

major comments (3)

[§3] §3 (ChangAn Benchmark): The central claim that 'current Chinese text detectors... fail to serve as reliable tools' depends on the four chosen LLMs and prompting strategies producing representative adversarial examples. No justification, ablation, or comparison to stronger few-shot/style-transfer/retrieval-augmented prompting is provided, so the negative result may be testbed-specific rather than general.
[§4] §4 (Experiments and Evaluation): The abstract and evaluation report performance variations but supply no quantitative metrics (e.g., accuracy/F1 scores with error bars), statistical tests, or details on detector decision thresholds and training data. This prevents full assessment of whether the reported limitations are statistically reliable or merely descriptive.
[§4.2] §4.2 (Detector Selection): The selection of exactly 12 detectors is presented without evidence that they adequately sample current Chinese text detection methods (e.g., no comparison to fine-tuned or ensemble approaches). Without this, the broad conclusion that detectors are unreliable rests on an untested sampling assumption.

minor comments (3)

[Abstract] Abstract: grammatical error ('AI-generated literary creations has raised' should be 'have raised').
[Introduction] Missing references to prior work on classical Chinese poetry generation or detection to contextualize the novelty of ChangAn.
[§4] No human baseline discrimination performance is reported, which would strengthen the claim that the task is inherently difficult for detectors.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, with clear indications of planned revisions to the manuscript.

read point-by-point responses

Referee: [§3] §3 (ChangAn Benchmark): The central claim that 'current Chinese text detectors... fail to serve as reliable tools' depends on the four chosen LLMs and prompting strategies producing representative adversarial examples. No justification, ablation, or comparison to stronger few-shot/style-transfer/retrieval-augmented prompting is provided, so the negative result may be testbed-specific rather than general.

Authors: We selected four widely used LLMs (GPT-3.5-Turbo, GPT-4, Claude-2, and Baichuan-13B) along with prompting strategies that include zero-shot, few-shot, and iterative refinement to reflect typical real-world usage for classical Chinese poetry generation. These choices were guided by prevalence in the literature and accessibility. We agree that explicit justification and discussion of stronger alternatives would improve the paper. In the revised version, we will expand §3 with a rationale for the selected models and strategies, add a limitations subsection acknowledging that retrieval-augmented or advanced style-transfer methods were not exhaustively compared, and note that future work could test such approaches. The consistent poor performance across the tested configurations supports our conclusions for representative cases, though we accept that broader testing would further generalize the findings. revision: partial
Referee: [§4] §4 (Experiments and Evaluation): The abstract and evaluation report performance variations but supply no quantitative metrics (e.g., accuracy/F1 scores with error bars), statistical tests, or details on detector decision thresholds and training data. This prevents full assessment of whether the reported limitations are statistically reliable or merely descriptive.

Authors: The evaluation section analyzes performance variations but does not present the full set of numerical results, error bars, or statistical tests in the main text. We will revise §4 to include comprehensive tables reporting accuracy, F1 scores, and other metrics with standard deviations from repeated evaluations where feasible. We will also add details on detector decision thresholds (where available from the tools) and summarize the training data for the ML-based detectors. Appropriate statistical tests (e.g., significance testing for performance differences) will be incorporated to support the reliability of the reported limitations. revision: yes
Referee: [§4.2] §4.2 (Detector Selection): The selection of exactly 12 detectors is presented without evidence that they adequately sample current Chinese text detection methods (e.g., no comparison to fine-tuned or ensemble approaches). Without this, the broad conclusion that detectors are unreliable rests on an untested sampling assumption.

Authors: The 12 detectors comprise a mix of commercial APIs and open-source models (including BERT- and RoBERTa-based detectors as well as statistical and rule-based methods) chosen to represent the range of publicly available Chinese text detection tools at the time of the study. In the revised manuscript, we will add an explicit justification subsection in §4.2 detailing the selection criteria and how these tools cover major categories of existing detectors. We will also discuss the limitation that fine-tuned or ensemble methods were not included, while maintaining that the chosen set provides a reasonable sample of current practice for the purposes of this benchmark evaluation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and detector evaluation

full rationale

The paper introduces a new dataset (ChangAn) of human-written and LLM-generated classical Chinese poetry and reports performance metrics for 12 existing detectors on that dataset. No equations, fitted parameters, predictions, or derivations are present; the central claim follows directly from running off-the-shelf detectors on the constructed testbed. No self-citation is used to justify uniqueness, ansatzes, or load-bearing premises. The evaluation is self-contained against external detector outputs and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of the ChangAn dataset construction and the choice of evaluation settings; no free parameters or invented entities are introduced beyond standard benchmark practices.

axioms (2)

domain assumption Classical Chinese poetry possesses unique linguistic features (metrical regularity, shared imagery, flexible syntax) that make detection substantially harder than for ordinary text.
Invoked in the abstract as the reason current detectors fail.
domain assumption The four LLMs and generation strategies used produce outputs that are representative of LLM capabilities in this domain.
Implicit in the claim that detectors fail on LLM-generated classical Chinese poetry.

pith-pipeline@v0.9.0 · 5550 in / 1395 out tokens · 46941 ms · 2026-05-10T16:24:01.777583+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...
Calibrating Model-Based Evaluation Metrics for Summarization
cs.CL 2026-04 unverdicted novelty 5.0

A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.

Reference graph

Works this paper leans on

7 extracted references · 3 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

InCCF International Conference on Natural Language Processing and Chinese Computing, pages 377–388

Chinese poetry generation with metrical con- straints. InCCF International Conference on Natural Language Processing and Chinese Computing, pages 377–388. Springer. Ruochen Mao, Yuling Shi, Xiaodong Gu, and Jiaheng Wei. 2025. Robust preference alignment via di- rectional neighborhood consensus.arXiv preprint arXiv:2510.20498. Xiaodong Meng. 2025. 科技与人文的博弈...

work page arXiv 2025
[2]

SWE-QA: Can Language Models Answer Repository-level Code Questions?

Llm evaluators recognize and favor their own generations.Advances in Neural Information Pro- cessing Systems, 37:68772–68802. Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xiaodong Gu. 2025. Swe-qa: Can language models answer repository-level code ques- tions?arXiv preprint arXiv:2509.14635. Brian Porter and Edouard Machery. 2024. A...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Kimi K2: Open Agentic Intelligence

Detectllm: Leveraging log rank information for zero-shot detection of machine-generated text. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 12395–12412. Koichi Tateishi and Shinobu Mizuguchi. 2025. Are we biased against ai-made haiku poems?Proceedings of the Linguistic Society of America, 10(1):5885–5885. Kimi Team, Yifan ...

work page internal anchor Pith review arXiv 2023
[4]

格律：存在变通与依谱的权衡总体合规：作品基本遵循了《水调歌头》一体的句式与韵位框架。可议之处：部分句子的平仄处理采取了较为宽泛或变通的方式。
[5]

蟫蟫”（yín，形容蠕动貌）形容车行，用字古奥，虽求新奇，但略显隔阂，影响阅读的流畅感。 “行襟

炼字：个别字词可更精纯新奇与生涩的平衡：如 “蟫蟫”（yín，形容蠕动貌）形容车行，用字古奥，虽求新奇，但略显隔阂，影响阅读的流畅感。 “行襟”（以衣襟喻行程）的比喻也稍显曲折，需要读者停顿解意。意象密度：“萤之甸，樱之海，梦之浔 ”三句意象并置，密度极高，但彼此间的逻辑或情感联结稍弱，略显堆叠。
[6]

穿山”、“暗轨”的隧道实景，突然切换到“仙踪绿野

意境连贯性：线索稍显跳跃空间转换：上阕从 “穿山”、“暗轨”的隧道实景，突然切换到“仙踪绿野”与“梦中心”的虚境，中间缺乏必要的过渡或暗示，转折略显突兀。主旨聚焦：下阕 “人间如是车节 ”引发哲理思考，但随后 “荒城谜霭”、“邻铺魅客”的具体场景，与结尾 “大月钟盘挂，时作发条音 ”的宏大、抽象比喻之间，意脉的衔接可以更紧密，以形成更强的整体合力。
[7]

夜车”旅程与古典词牌结合，并注入“熔金

新颖性：融合的完成度概念的创新：将现代 “夜车”旅程与古典词牌结合，并注入“熔金”、“钟盘发条 ”等工业、机械意象，构思本身具有强烈的现代性和实验精神。融合的挑战：部分现代或奇幻意象（如“熔金”、“发条音”）与词中“仙踪”、“魅客”等古典幽玄意象，以及 “萤甸”、“樱海”等自然意象，尚未完全熔铸成一个气息贯通、自洽浑成的独立意境世界，稍欠圆融。 Round 2: Refinement Refined Poem: 《水调歌头·夜车》穿隧夜车急，金轨照霜深。铁龙流影千里，风急万山阴。灯火星辰对语，客梦风帆暂泊，远度月随心。此夜三更冷，万象寂无音。晓色浅，星欲坠，路犹斟。长歌一枕南北，陌上故园寻。草木潜移故径，云雾低垂野渚，人事共沉沉。谁识轮蹄远，照我旧乡林。 Figure 8: The ex...