pith. sign in

arxiv: 2510.15313 · v2 · submitted 2025-10-17 · 💻 cs.CL

Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

Pith reviewed 2026-05-18 06:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsclassical chinese poetrytang poetryevaluation biasecho chamber effectprosodic ruleshuman expert validationpoetry generation
0
0 comments X

The pith

LLMs rate machine-generated Tang poems that break prosodic rules higher than human experts do.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates how well current large language models generate classical Chinese Tang poetry and how well they judge such poetry. It introduces a three-step process that checks form with metrics, asks the models themselves to score outputs, and then brings in human experts for comparison. The central finding is that the models consistently give higher marks to other model outputs that copy common word patterns but ignore traditional rhythm and structure rules. Human judges apply stricter standards and disagree with the models on which poems are better. This gap suggests that models cannot serve as independent judges for tasks that depend on deep cultural knowledge of poetic conventions.

Core claim

Using computational metrics, LLM-as-judge scoring, and human expert validation on poems generated by six state-of-the-art models, the study identifies an echo chamber effect in which LLMs overrate machine-generated Tang poems that mimic statistical patterns yet violate strict prosodic rules, in clear divergence from human expert judgments.

What carries the argument

The three-step evaluation framework that runs computational metrics first, then LLM self-judgment, then human expert validation to expose mismatches in quality assessment.

If this is right

  • LLMs cannot be relied upon as standalone evaluators for classical Chinese poetry or similar rule-bound cultural forms.
  • Hybrid evaluation that includes human oversight is required for trustworthy assessment of generated poetry.
  • Current models need improved mechanisms to enforce prosodic and stylistic constraints during generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pattern of overrating pattern-matching outputs may appear when LLMs evaluate other traditional art forms with strict formal rules.
  • Training data that rewards surface-level statistical fit over rule adherence likely contributes to the observed bias.
  • Expanding the study to include poems from other dynasties or languages could test whether the echo chamber effect is specific to Tang poetry.

Load-bearing premise

Human expert ratings provide the correct and unbiased standard against which LLM judgments of Tang poetry quality can be measured.

What would settle it

A follow-up test in which human experts rate the same set of LLM-generated poems that break prosody as equal to or higher than the LLMs themselves rate them.

Figures

Figures reproduced from arXiv: 2510.15313 by Anna-Carolina Haensch, Bolei Ma, Yina Yao.

Figure 1
Figure 1. Figure 1: The basic framework of poetry generation and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Top 10 keywords for all models, with highest [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Entropy scores across models. Model Group 1 Model Group 2 Mean Diff. p-adj Reject H0 Conclusion Baichuan Other Models > 1.03 <.001 True Stat. Unique (Class 3) DeepSeek Qwen 0.0197 0.936 False Stat. Indistinguishable DeepSeek Gemma -0.1865 <.001 True Sig. Different Mistral GLM 0.2652 <.001 True Sig. Different Mid-Tier Models High-Tier Models > 0.43 <.001 True Sig. Different [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 5
Figure 5. Figure 5: Semantic distinction by theme. Gemma-2 GLM-4 Baichuan2 DeepSeek-V2 Qwen2.5Mistral Model -0.020 -0.015 -0.010 -0.005 0.000 Cultural Association Score (Higher = Better) -0.021 -0.011 -0.011 -0.011 -0.005 -0.005 Imagery: (Willow) Gemma-2Mistral DeepSeek-V2 GLM-4 Baichuan2 Qwen2.5 Model -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 Cultural Association Score (Higher = Better) -0.009 0.001 0.002 0.003 0.005 Image… view at source ↗
Figure 6
Figure 6. Figure 6: Imagery-Emotion Association in cultural as [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: LLM-as-a-judge cross evaluation results. The [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cross-model semantic alignment. 4.2 LLM-as-a-judge Evaluation Building upon the objective computational founda￾tions established in analysis 1, the second analysis of our evaluation framework shifts to subjective poetry quality assessment through the LLMs them￾selves. This LLM-as-a-judge evaluation analysis employs the comprehensive 6 × 6 evaluation ma￾trix, where each model serves as both generator and ev… view at source ↗
Figure 9
Figure 9. Figure 9: Bias between self and other-LLM-as-a-judge [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: LLM-as-a-judge evaluation scores for each [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
Figure 10
Figure 10. Figure 10: Interaction plot of mean scores by Generator [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Poem Generation Flow Chart. The experi￾mental loop generated poems for combinations of all dimension elements, after which outputs were cleansed and validated against structural and content criteria. Failed generations triggered adaptive retries with in￾creased temperature, and all results were stored incre￾mentally in JSONL to ensure reliability. """请从以下文本中提取纯净的诗歌内 容。 **规则**: - 保留诗歌的所有诗句和原有的换行分段 - 删除任何前缀… view at source ↗
Figure 13
Figure 13. Figure 13: Poem Cleaning Flow Chart. For text clean [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 16
Figure 16. Figure 16: Entropy scores across models by imagery. [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Entropy scores across models by poet. given another variable from a specific subgroup (Subgroup): I(X; Y ) = H(Y ) − H(Y | X) (5) It indicates how much knowing one variable (e.g., X) reduces uncertainty about another variable (e.g., Y ). A higher information gain indicates that knowing one variable reduces uncertainty about another variable. A significant drop in entropy in￾dicates that a given dimension … view at source ↗
Figure 14
Figure 14. Figure 14: Entropy scores across models by emotion. [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Entropy scores across models by form. L Additional Results - Information Gain To quantify how different creative dimensions in the prompts influence the diversity of the generated poems, we analyzed the information gain for each dimension, drawing from Ma et al. (2025). It mea￾sures how much information one random variable provides about another. It is calculated as the dif￾ference between the entropy of … view at source ↗
Figure 18
Figure 18. Figure 18: Information Gain by Form. Each subplot compares the overall group entropy ( [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Information Gain by Theme. Each subplot compares the overall group entropy ( [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Information Gain by Poet. Each subplot compares the overall group entropy ( [PITH_FULL_IMAGE:figures/full_fig_p019_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Information Gain by Emotion. Each subplot compares the overall group entropy ( [PITH_FULL_IMAGE:figures/full_fig_p019_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Information Gain by Imagery. Each subplot compares the overall group entropy ( [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Human evaluation score analysis for Deepseek. [PITH_FULL_IMAGE:figures/full_fig_p021_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Human evaluation score analysis for Gemma. [PITH_FULL_IMAGE:figures/full_fig_p021_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Human evaluation score analysis for Qwen. [PITH_FULL_IMAGE:figures/full_fig_p021_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Human vs LLM evaluation correlation for DeepSeek generated poems (averaged across 6 LLM judges). [PITH_FULL_IMAGE:figures/full_fig_p022_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Human vs LLM evaluation correlation for Gemma generated poems (averaged across 6 LLM judges). [PITH_FULL_IMAGE:figures/full_fig_p022_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Human vs LLM evaluation correlation for Qwen generated poems (averaged across 6 LLM judges). [PITH_FULL_IMAGE:figures/full_fig_p022_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Human vs LLM evaluation correlation by dimension for DeepSeek generated poems (averaged across 6 [PITH_FULL_IMAGE:figures/full_fig_p023_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Human vs LLM evaluation correlation by dimension for Gemma generated poems (averaged across 6 [PITH_FULL_IMAGE:figures/full_fig_p023_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Human vs LLM evaluation correlation by dimension for Qwen generated poems (averaged across 6 LLM [PITH_FULL_IMAGE:figures/full_fig_p023_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Human-LLM correlation by judge model and dimension for DeepSeek generated poems. [PITH_FULL_IMAGE:figures/full_fig_p024_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Human-LLM correlation by judge model and dimension for Gemma generated poems. [PITH_FULL_IMAGE:figures/full_fig_p024_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Human-LLM correlation by judge model and dimension for Qwen generated poems. [PITH_FULL_IMAGE:figures/full_fig_p024_34.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly applied to creative domains, yet their performance in classical Chinese poetry generation and evaluation remains poorly understood. We propose a three-step evaluation framework that combines computational metrics, LLM-as-a-judge assessment, and human expert validation. Using this framework, we evaluate six state-of-the-art LLMs across multiple dimensions of poetic quality, including themes, emotions, imagery, form, and style, in the context of Tang poetry generation. Our analysis reveals a critical "echo chamber" effect: LLMs systematically overrate machine-generated poems that mimic statistical patterns yet fail strict prosodic rules, diverging significantly from human expert judgments. These findings underscore the limitations of using LLMs as standalone evaluators for culturally complex tasks, highlighting the necessity of hybrid human-model validation frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a three-step evaluation framework combining computational metrics, LLM-as-a-judge assessment, and human expert validation to evaluate six state-of-the-art LLMs on Tang poetry generation across dimensions including themes, emotions, imagery, form, and style. It reports a critical 'echo chamber' effect in which LLMs systematically overrate machine-generated poems that mimic statistical patterns but violate strict prosodic rules, with these judgments diverging significantly from those of human experts, and concludes that LLMs cannot serve as standalone evaluators for culturally complex creative tasks.

Significance. If the central findings hold after addressing the methodological gaps, the work would be significant for the study of LLM evaluation biases in creative and culturally specific domains. It provides empirical support for the limitations of LLM self-assessment in poetry generation and strengthens the case for hybrid human-model validation frameworks, which is timely given increasing use of LLMs in literary and artistic applications.

major comments (2)
  1. [Abstract and three-step evaluation framework] The description of the three-step evaluation framework provides no quantitative details on how divergence between LLM and human judgments was measured, including sample sizes for generated poems and ratings, statistical tests employed, or controls for poem selection and selection bias. This information is essential to substantiate the 'echo chamber' claim and is absent from the abstract and framework overview.
  2. [Human expert validation component of the framework] The human expert validation step, positioned as the key external ground truth for detecting LLM overrating of prosodically invalid poems, reports no information on the number of experts, their qualifications in Tang poetry prosody and aesthetics, the precise rating protocol, or inter-rater agreement statistics such as Cohen’s or Fleiss’ kappa. Without these, the observed divergence could reflect expert variability rather than model bias, directly undermining the central claim.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by briefly noting the scale of the experiment (e.g., number of poems generated or evaluated) to give readers an immediate sense of the empirical basis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for improving the clarity and transparency of our methodological descriptions. We address each major comment below and will revise the manuscript to incorporate additional details where needed.

read point-by-point responses
  1. Referee: [Abstract and three-step evaluation framework] The description of the three-step evaluation framework provides no quantitative details on how divergence between LLM and human judgments was measured, including sample sizes for generated poems and ratings, statistical tests employed, or controls for poem selection and selection bias. This information is essential to substantiate the 'echo chamber' claim and is absent from the abstract and framework overview.

    Authors: We agree that the abstract and high-level framework overview would benefit from greater self-containment with quantitative details. While the full manuscript elaborates on these aspects in the Methods and Results sections, we will revise the abstract to include key quantitative elements such as sample sizes for poem generation and human ratings, the statistical tests used to quantify divergence, and controls for selection bias. We will also add a concise summary table or paragraph in the framework overview section to make this information more accessible without requiring readers to consult later sections. revision: yes

  2. Referee: [Human expert validation component of the framework] The human expert validation step, positioned as the key external ground truth for detecting LLM overrating of prosodically invalid poems, reports no information on the number of experts, their qualifications in Tang poetry prosody and aesthetics, the precise rating protocol, or inter-rater agreement statistics such as Cohen’s or Fleiss’ kappa. Without these, the observed divergence could reflect expert variability rather than model bias, directly undermining the central claim.

    Authors: We acknowledge that the description of the human expert validation requires more explicit detail to fully substantiate its role as reliable ground truth. In the revised manuscript, we will expand the relevant section to specify the number of experts, their qualifications and expertise in Tang poetry prosody and aesthetics, the exact rating protocol employed, and inter-rater agreement statistics such as Fleiss’ kappa. These additions will help clarify that the observed divergences are more likely attributable to LLM evaluation biases than to variability among the experts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation relies on external human validation

full rationale

The paper proposes a three-step framework (computational metrics + LLM-as-judge + human expert validation) and reports divergence between LLM ratings and human judgments on prosodic validity. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described method. The central 'echo chamber' claim is positioned against independent human expert input rather than reducing to LLM self-assessment by construction. Human validation functions as an external benchmark, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are described. The framework implicitly assumes human experts are reliable ground truth and that the selected models represent current LLM capabilities.

axioms (1)
  • domain assumption Human expert judgments constitute the reliable ground truth for assessing adherence to Tang poetry prosodic rules and overall quality.
    The paper uses divergence from human experts to identify LLM bias, making this assumption load-bearing for the echo chamber claim.

pith-pipeline@v0.9.0 · 5669 in / 1195 out tokens · 41764 ms · 2026-05-18T06:39:11.259373+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    InThe Twelfth International Conference on Learning Representa- tions

    Chateval: Towards better LLM-based eval- uators through multi-agent debate. InThe Twelfth International Conference on Learning Representa- tions. Yanran Chen, Hannes Gröner, Sina Zarrieß, and Steffen Eger. 2024. Evaluating diversity in automatic poetry generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing, ...

  2. [2]

    five-character quatrain

    All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296, Online. Association for Computational Linguistics. DeepSeek-AI. 2024. D...

  3. [3]

    **韵律格 律(prosodic adherence)**:音 韵、平仄、押韵、节奏

  4. [4]

    **主题切合 度(thematic relevance)**:主题 明确性与契合度

  5. [5]

    **情感一致 性(emotional consistency)**:情 感表达的真挚性与一致性

  6. [6]

    **意象与结 构(imagery structure)**:意象生 动性与结构合理性

  7. [7]

    prosodic_adherence

    **语言经典 性(language authenticity)**:语 言的古典韵味与准确性 #诗歌信息 - **诗人**: {poet} - **主题**: {theme} - **情感**: {emotion} - **意象**: {imagery} - **形式**: {form} - **诗歌正文**: {poem_text} #输出要求 请直接输出JSON格式的评分结果,每个维 度只需给出1-5的整数分数: { "prosodic_adherence":分数, "thematic_relevance":分数, "emotional_consistency":分数, "imagery_structure":分数, "language_authenticity":分数 } EN: # Role Yo...

  8. [8]

    **prosodic adherence**: Rhyme, tonal pattern, rhyme scheme, rhythm ,→ ,→

  9. [9]

    **thematic relevance**: Clarity and appropriateness of theme,→

  10. [10]

    **emotional consistency**: Sincerity and consistency of emotional expression ,→ ,→

  11. [11]

    **imagery structure**: Vividness of imagery and soundness of structure ,→ ,→

  12. [12]

    prosodic_adherence

    **language authenticity**: Classical flavor and accuracy of language ,→ ,→ # Poem Information - **Poet**: {poet} - **Theme**: {theme} - **Emotion**: {emotion} - **Imagery**: {imagery} - **Form**: {form} - **Poem Text**: {poem_text} # Output Requirement Please output the scoring results directly in JSON format, giving an integer 1-5 for each dimension: ,→ ...

  13. [13]

    Qwen2.5-7B-Instruct: 0.231

  14. [14]

    gemma-2-9b-it: 0.226

  15. [15]

    Baichuan2-7B-Chat: 0.105

  16. [16]

    DeepSeek-V2-Lite-Chat: -0.021

  17. [17]

    Mistral-7B-Instruct-v0.3: -0.031

  18. [18]

    glm-4-9b-chat-hf: -0.218 Human-LLM Correlation by Judge Model and Dimension - DeepSeek Generated Poems Comparing Performance of 6 Different LLM Judges LLM Judge Baichuan2-7B-Chat DeepSeek-V2-Lite-Chat Mistral-7B-Instruct-v0.3 Qwen2.5-7B-Instruct gemma-2-9b-it glm-4-9b-chat-hf Figure 32: Human-LLM correlation by judge model and dimension for DeepSeek gener...

  19. [19]

    gemma-2-9b-it: 0.305

  20. [20]

    DeepSeek-V2-Lite-Chat: 0.185

  21. [21]

    Baichuan2-7B-Chat: 0.182

  22. [22]

    glm-4-9b-chat-hf: 0.155

  23. [23]

    Qwen2.5-7B-Instruct: 0.122

  24. [24]

    Mistral-7B-Instruct-v0.3: 0.109 Human-LLM Correlation by Judge Model and Dimension - Gemma Generated Poems Comparing Performance of 6 Different LLM Judges LLM Judge Baichuan2-7B-Chat DeepSeek-V2-Lite-Chat Mistral-7B-Instruct-v0.3 Qwen2.5-7B-Instruct gemma-2-9b-it glm-4-9b-chat-hf Figure 33: Human-LLM correlation by judge model and dimension for Gemma gene...

  25. [25]

    DeepSeek-V2-Lite-Chat: 0.108

  26. [26]

    Mistral-7B-Instruct-v0.3: 0.099

  27. [27]

    glm-4-9b-chat-hf: 0.080

  28. [28]

    Baichuan2-7B-Chat: 0.069

  29. [29]

    gemma-2-9b-it: 0.063

  30. [30]

    Qwen2.5-7B-Instruct: -0.088 Human-LLM Correlation by Judge Model and Dimension - Qwen Generated Poems Comparing Performance of 6 Different LLM Judges LLM Judge Baichuan2-7B-Chat DeepSeek-V2-Lite-Chat Mistral-7B-Instruct-v0.3 Qwen2.5-7B-Instruct gemma-2-9b-it glm-4-9b-chat-hf Figure 34: Human-LLM correlation by judge model and dimension for Qwen generated ...