Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry
Pith reviewed 2026-05-18 06:39 UTC · model grok-4.3
The pith
LLMs rate machine-generated Tang poems that break prosodic rules higher than human experts do.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using computational metrics, LLM-as-judge scoring, and human expert validation on poems generated by six state-of-the-art models, the study identifies an echo chamber effect in which LLMs overrate machine-generated Tang poems that mimic statistical patterns yet violate strict prosodic rules, in clear divergence from human expert judgments.
What carries the argument
The three-step evaluation framework that runs computational metrics first, then LLM self-judgment, then human expert validation to expose mismatches in quality assessment.
If this is right
- LLMs cannot be relied upon as standalone evaluators for classical Chinese poetry or similar rule-bound cultural forms.
- Hybrid evaluation that includes human oversight is required for trustworthy assessment of generated poetry.
- Current models need improved mechanisms to enforce prosodic and stylistic constraints during generation.
Where Pith is reading between the lines
- The same pattern of overrating pattern-matching outputs may appear when LLMs evaluate other traditional art forms with strict formal rules.
- Training data that rewards surface-level statistical fit over rule adherence likely contributes to the observed bias.
- Expanding the study to include poems from other dynasties or languages could test whether the echo chamber effect is specific to Tang poetry.
Load-bearing premise
Human expert ratings provide the correct and unbiased standard against which LLM judgments of Tang poetry quality can be measured.
What would settle it
A follow-up test in which human experts rate the same set of LLM-generated poems that break prosody as equal to or higher than the LLMs themselves rate them.
Figures
read the original abstract
Large Language Models (LLMs) are increasingly applied to creative domains, yet their performance in classical Chinese poetry generation and evaluation remains poorly understood. We propose a three-step evaluation framework that combines computational metrics, LLM-as-a-judge assessment, and human expert validation. Using this framework, we evaluate six state-of-the-art LLMs across multiple dimensions of poetic quality, including themes, emotions, imagery, form, and style, in the context of Tang poetry generation. Our analysis reveals a critical "echo chamber" effect: LLMs systematically overrate machine-generated poems that mimic statistical patterns yet fail strict prosodic rules, diverging significantly from human expert judgments. These findings underscore the limitations of using LLMs as standalone evaluators for culturally complex tasks, highlighting the necessity of hybrid human-model validation frameworks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a three-step evaluation framework combining computational metrics, LLM-as-a-judge assessment, and human expert validation to evaluate six state-of-the-art LLMs on Tang poetry generation across dimensions including themes, emotions, imagery, form, and style. It reports a critical 'echo chamber' effect in which LLMs systematically overrate machine-generated poems that mimic statistical patterns but violate strict prosodic rules, with these judgments diverging significantly from those of human experts, and concludes that LLMs cannot serve as standalone evaluators for culturally complex creative tasks.
Significance. If the central findings hold after addressing the methodological gaps, the work would be significant for the study of LLM evaluation biases in creative and culturally specific domains. It provides empirical support for the limitations of LLM self-assessment in poetry generation and strengthens the case for hybrid human-model validation frameworks, which is timely given increasing use of LLMs in literary and artistic applications.
major comments (2)
- [Abstract and three-step evaluation framework] The description of the three-step evaluation framework provides no quantitative details on how divergence between LLM and human judgments was measured, including sample sizes for generated poems and ratings, statistical tests employed, or controls for poem selection and selection bias. This information is essential to substantiate the 'echo chamber' claim and is absent from the abstract and framework overview.
- [Human expert validation component of the framework] The human expert validation step, positioned as the key external ground truth for detecting LLM overrating of prosodically invalid poems, reports no information on the number of experts, their qualifications in Tang poetry prosody and aesthetics, the precise rating protocol, or inter-rater agreement statistics such as Cohen’s or Fleiss’ kappa. Without these, the observed divergence could reflect expert variability rather than model bias, directly undermining the central claim.
minor comments (1)
- [Abstract] The abstract would be strengthened by briefly noting the scale of the experiment (e.g., number of poems generated or evaluated) to give readers an immediate sense of the empirical basis.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important areas for improving the clarity and transparency of our methodological descriptions. We address each major comment below and will revise the manuscript to incorporate additional details where needed.
read point-by-point responses
-
Referee: [Abstract and three-step evaluation framework] The description of the three-step evaluation framework provides no quantitative details on how divergence between LLM and human judgments was measured, including sample sizes for generated poems and ratings, statistical tests employed, or controls for poem selection and selection bias. This information is essential to substantiate the 'echo chamber' claim and is absent from the abstract and framework overview.
Authors: We agree that the abstract and high-level framework overview would benefit from greater self-containment with quantitative details. While the full manuscript elaborates on these aspects in the Methods and Results sections, we will revise the abstract to include key quantitative elements such as sample sizes for poem generation and human ratings, the statistical tests used to quantify divergence, and controls for selection bias. We will also add a concise summary table or paragraph in the framework overview section to make this information more accessible without requiring readers to consult later sections. revision: yes
-
Referee: [Human expert validation component of the framework] The human expert validation step, positioned as the key external ground truth for detecting LLM overrating of prosodically invalid poems, reports no information on the number of experts, their qualifications in Tang poetry prosody and aesthetics, the precise rating protocol, or inter-rater agreement statistics such as Cohen’s or Fleiss’ kappa. Without these, the observed divergence could reflect expert variability rather than model bias, directly undermining the central claim.
Authors: We acknowledge that the description of the human expert validation requires more explicit detail to fully substantiate its role as reliable ground truth. In the revised manuscript, we will expand the relevant section to specify the number of experts, their qualifications and expertise in Tang poetry prosody and aesthetics, the exact rating protocol employed, and inter-rater agreement statistics such as Fleiss’ kappa. These additions will help clarify that the observed divergences are more likely attributable to LLM evaluation biases than to variability among the experts. revision: yes
Circularity Check
No significant circularity; evaluation relies on external human validation
full rationale
The paper proposes a three-step framework (computational metrics + LLM-as-judge + human expert validation) and reports divergence between LLM ratings and human judgments on prosodic validity. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described method. The central 'echo chamber' claim is positioned against independent human expert input rather than reducing to LLM self-assessment by construction. Human validation functions as an external benchmark, satisfying the self-contained criterion.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human expert judgments constitute the reliable ground truth for assessing adherence to Tang poetry prosodic rules and overall quality.
Reference graph
Works this paper leans on
-
[1]
InThe Twelfth International Conference on Learning Representa- tions
Chateval: Towards better LLM-based eval- uators through multi-agent debate. InThe Twelfth International Conference on Learning Representa- tions. Yanran Chen, Hannes Gröner, Sina Zarrieß, and Steffen Eger. 2024. Evaluating diversity in automatic poetry generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing, ...
work page 2024
-
[2]
All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296, Online. Association for Computational Linguistics. DeepSeek-AI. 2024. D...
-
[3]
**韵律格 律(prosodic adherence)**:音 韵、平仄、押韵、节奏
-
[4]
**主题切合 度(thematic relevance)**:主题 明确性与契合度
-
[5]
**情感一致 性(emotional consistency)**:情 感表达的真挚性与一致性
-
[6]
**意象与结 构(imagery structure)**:意象生 动性与结构合理性
-
[7]
**语言经典 性(language authenticity)**:语 言的古典韵味与准确性 #诗歌信息 - **诗人**: {poet} - **主题**: {theme} - **情感**: {emotion} - **意象**: {imagery} - **形式**: {form} - **诗歌正文**: {poem_text} #输出要求 请直接输出JSON格式的评分结果,每个维 度只需给出1-5的整数分数: { "prosodic_adherence":分数, "thematic_relevance":分数, "emotional_consistency":分数, "imagery_structure":分数, "language_authenticity":分数 } EN: # Role Yo...
-
[8]
**prosodic adherence**: Rhyme, tonal pattern, rhyme scheme, rhythm ,→ ,→
-
[9]
**thematic relevance**: Clarity and appropriateness of theme,→
-
[10]
**emotional consistency**: Sincerity and consistency of emotional expression ,→ ,→
-
[11]
**imagery structure**: Vividness of imagery and soundness of structure ,→ ,→
-
[12]
**language authenticity**: Classical flavor and accuracy of language ,→ ,→ # Poem Information - **Poet**: {poet} - **Theme**: {theme} - **Emotion**: {emotion} - **Imagery**: {imagery} - **Form**: {form} - **Poem Text**: {poem_text} # Output Requirement Please output the scoring results directly in JSON format, giving an integer 1-5 for each dimension: ,→ ...
work page 2019
-
[13]
Qwen2.5-7B-Instruct: 0.231
-
[14]
gemma-2-9b-it: 0.226
-
[15]
Baichuan2-7B-Chat: 0.105
-
[16]
DeepSeek-V2-Lite-Chat: -0.021
-
[17]
Mistral-7B-Instruct-v0.3: -0.031
-
[18]
glm-4-9b-chat-hf: -0.218 Human-LLM Correlation by Judge Model and Dimension - DeepSeek Generated Poems Comparing Performance of 6 Different LLM Judges LLM Judge Baichuan2-7B-Chat DeepSeek-V2-Lite-Chat Mistral-7B-Instruct-v0.3 Qwen2.5-7B-Instruct gemma-2-9b-it glm-4-9b-chat-hf Figure 32: Human-LLM correlation by judge model and dimension for DeepSeek gener...
-
[19]
gemma-2-9b-it: 0.305
-
[20]
DeepSeek-V2-Lite-Chat: 0.185
-
[21]
Baichuan2-7B-Chat: 0.182
-
[22]
glm-4-9b-chat-hf: 0.155
-
[23]
Qwen2.5-7B-Instruct: 0.122
-
[24]
Mistral-7B-Instruct-v0.3: 0.109 Human-LLM Correlation by Judge Model and Dimension - Gemma Generated Poems Comparing Performance of 6 Different LLM Judges LLM Judge Baichuan2-7B-Chat DeepSeek-V2-Lite-Chat Mistral-7B-Instruct-v0.3 Qwen2.5-7B-Instruct gemma-2-9b-it glm-4-9b-chat-hf Figure 33: Human-LLM correlation by judge model and dimension for Gemma gene...
-
[25]
DeepSeek-V2-Lite-Chat: 0.108
-
[26]
Mistral-7B-Instruct-v0.3: 0.099
-
[27]
glm-4-9b-chat-hf: 0.080
-
[28]
Baichuan2-7B-Chat: 0.069
-
[29]
gemma-2-9b-it: 0.063
-
[30]
Qwen2.5-7B-Instruct: -0.088 Human-LLM Correlation by Judge Model and Dimension - Qwen Generated Poems Comparing Performance of 6 Different LLM Judges LLM Judge Baichuan2-7B-Chat DeepSeek-V2-Lite-Chat Mistral-7B-Instruct-v0.3 Qwen2.5-7B-Instruct gemma-2-9b-it glm-4-9b-chat-hf Figure 34: Human-LLM correlation by judge model and dimension for Qwen generated ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.