Recognition: no theorem link
POEMetric: The Last Stanza of Humanity
Pith reviewed 2026-05-13 17:10 UTC · model grok-4.3
The pith
Current language models can follow poem forms and themes but fall short of humans in creativity, emotional resonance, and overall quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that while large language models generate poems that closely match specified forms and themes, they do not reach human levels of creativity, idiosyncrasy, emotional resonance, or skillful use of imagery and literary devices, and therefore produce poems of lower overall quality.
What carries the argument
POEMetric, the three-part evaluation framework that scores basic form-and-theme adherence, advanced poetic abilities, and overall quality through rule-based checks plus LLM-as-judge ratings validated by human experts.
If this is right
- LLMs reach high accuracy on replicating fixed poetic forms and stated themes.
- Human poets score higher than any tested model on creativity, idiosyncrasy, emotional resonance, imagery, and literary devices.
- Overall poem quality is higher for human work than for the strongest LLM output.
- Poetry generation continues to pose a substantial challenge for large language models.
Where Pith is reading between the lines
- Further scaling of model size or training data may not automatically close the observed gap in creative expression.
- The same evaluation approach could be applied to other creative writing tasks to test whether the pattern holds beyond poetry.
- Results suggest that human judgment will stay necessary for assessing subjective artistic qualities even as automated judges improve.
Load-bearing premise
That LLM-as-a-judge scores together with human expert validation give a reliable measure of subjective qualities such as creativity and emotional resonance.
What would settle it
A new set of poems generated by current models that human experts rate higher than the human poems in creativity, emotional resonance, imagery, literary devices, and overall quality.
Figures
read the original abstract
Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini-2.5-Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best-performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs. Data and codes are released at https://github.com/Bingru-Li/POEMetric.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces POEMetric, a comprehensive evaluation framework for poetry generation that assesses LLMs on basic instruction-following (form and theme adherence via rule-based metrics), advanced abilities (creativity, idiosyncrasy, emotional resonance, imagery, and literary devices via LLM-as-a-judge), and overall quality plus authorship estimation. It curates a dataset of 203 human poems across 7 fixed forms, generates 6,090 poems from 30 LLMs using the same prompts, and reports that top LLMs reach high form accuracy (4.26/5) and theme alignment (4.99) but lag humans on advanced qualities (e.g., human creativity 4.02, emotional resonance 4.06, overall quality 4.22 vs. 3.20 for best LLM) and that poetry remains a challenge for LLMs. Results are validated by human experts.
Significance. If the subjective evaluation components prove reliable, the work supplies a useful benchmark and dataset release that quantifies the gap between LLMs and humans on creative dimensions of poetry, underscoring that structural compliance is easier for models than nuanced artistic qualities. The dual use of rule-based and judge-based metrics is a constructive step toward more reproducible assessment in creative NLP.
major comments (2)
- [Evaluation Methodology] Evaluation Methodology section: The abstract and results state that LLM-as-a-judge outputs (e.g., creativity, emotional resonance, overall quality scores) were 'validated by human experts,' yet no details are supplied on expert count, blinding, inter-rater reliability (Cohen’s κ or ICC), or correlation between LLM and human ratings. Because the headline claim that LLMs lag on advanced abilities rests entirely on these subjective scores, the absence of these statistics prevents verification of the reported gaps (e.g., 4.02 vs. lower LLM values).
- [Results] Results section: The identification of Gemini-2.5-Pro as the top model and the assertion that 'all models failed' to match human advanced-ability levels appear post-hoc; the manuscript does not report pre-registered statistical tests, effect sizes, or multiple-comparison corrections across the 30 models. This weakens the strength of the cross-model and human-LLM comparisons.
minor comments (1)
- [Abstract] Abstract: The scoring rubrics for the 1–5 subjective scales are not described even at a high level; a single sentence summarizing the rubric anchors would aid readers.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript to improve clarity and rigor in the evaluation methodology and statistical reporting.
read point-by-point responses
-
Referee: [Evaluation Methodology] Evaluation Methodology section: The abstract and results state that LLM-as-a-judge outputs (e.g., creativity, emotional resonance, overall quality scores) were 'validated by human experts,' yet no details are supplied on expert count, blinding, inter-rater reliability (Cohen’s κ or ICC), or correlation between LLM and human ratings. Because the headline claim that LLMs lag on advanced abilities rests entirely on these subjective scores, the absence of these statistics prevents verification of the reported gaps (e.g., 4.02 vs. lower LLM values).
Authors: We agree that the current manuscript omits important details on the human validation process. In the revised version, we will expand the Evaluation Methodology section to specify the number of human experts, the blinding procedures used, inter-rater reliability statistics (e.g., ICC or Cohen’s κ), and the correlation between LLM-as-a-judge scores and human ratings. This addition will directly support verification of the reported performance gaps. revision: yes
-
Referee: [Results] Results section: The identification of Gemini-2.5-Pro as the top model and the assertion that 'all models failed' to match human advanced-ability levels appear post-hoc; the manuscript does not report pre-registered statistical tests, effect sizes, or multiple-comparison corrections across the 30 models. This weakens the strength of the cross-model and human-LLM comparisons.
Authors: We acknowledge that the model ranking and claims were derived from observed results rather than pre-registered analyses. In the revision, we will add effect sizes for human-LLM and cross-model comparisons, apply appropriate multiple-comparison corrections, and explicitly note the exploratory nature of the analysis. Pre-registration was not conducted for this initial study, which we will discuss as a limitation while ensuring the reported differences are supported by the added statistical details. revision: partial
Circularity Check
No circularity in POEMetric derivation chain
full rationale
The paper defines POEMetric as a new composite framework: rule-based metrics for form accuracy and theme alignment plus LLM-as-a-judge (Gemini-2.5-Pro) for creativity, emotional resonance, etc., with separate human-expert validation. Human poems come from an external curated dataset of 203 poems; LLM poems are generated to match the same forms/themes and then scored under the identical protocol. No equations, fitted parameters, or self-citations reduce any reported gap (e.g., human creativity 4.02 vs. LLM) to the input data by construction. The evaluation protocol is externally anchored and does not contain self-definitional or load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human expert validation reliably confirms the accuracy of LLM-as-a-judge scores for subjective poetic attributes.
invented entities (1)
-
POEMetric framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
(Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
The poem follows the given prompt in terms of form, including meter and rhyme where applicable. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[2]
The poem follows the given prompt in terms of its theme. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 17 Published as a conference paper at ICLR 2026 4 - Agree 5 - Strongly agree
work page 2026
-
[3]
(Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
The poem uses a varied vocabulary. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[4]
(Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
The poem is a creative work. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[5]
(Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
This poem shows idiosyncrasy. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[6]
(Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
This poem evokes emotional resonance. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[7]
The imagery in this poem is used well. (Required) __________ 18 Published as a conference paper at ICLR 2026 0 - N/A (No imagery is used) 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
work page 2026
-
[8]
At least one of the literary devices listed below is used well in the poem. (Required) __________ - Simile - Metaphor - Personification - Allusion 0 - N/A (No literary devices are used) 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[9]
___________________________________________________________________
Please comment on why you gave the answer that you did for question 8 above. ___________________________________________________________________
-
[10]
(Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
This is a good poem. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[11]
Please comment on why you gave the answer that you did for question 10 above. ___________________________________________________________________ 19 Published as a conference paper at ICLR 2026
work page 2026
-
[12]
(Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
The poem is written by a human. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[13]
Please give comments on why you gave the answer that you did for question 12 above. ___________________________________________________________________ 20 Published as a conference paper at ICLR 2026 POEMetric-based LLM Evaluation prompt # Role Description You are a professional poetry critic and analyst. Your job is to evaluate English poetry written by ...
work page 2026
-
[14]
1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
The poem follows the given prompt in terms of form, including meter and rhyme where applicable. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[15]
The poem follows the given prompt in terms of its theme. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree 21 Published as a conference paper at ICLR 2026
work page 2026
-
[16]
1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
The poem uses a varied vocabulary. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[17]
1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
The poem is a creative work. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[18]
1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
This poem shows idiosyncrasy. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[19]
1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
This poem evokes emotional resonance. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[20]
The imagery in this poem is used well. 0 - N/A (No imagery is used) 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[21]
At least one of the literary devices listed below is used well in the poem. - Simile - Metaphor - Personification - Allusion 0 - N/A (No literary devices are used) 1 - Strongly disagree 22 Published as a conference paper at ICLR 2026 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
work page 2026
-
[22]
Please comment on why you gave the answer that you did for question 8 above
-
[23]
1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
This is a good poem. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[24]
Please comment on why you gave the answer that you did for question 10 above
-
[25]
1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
The poem is written by a human. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree
-
[26]
Please give comments on why you gave the answer that you did for question 12 above. ## Output Format For each multiple-choice question, please give your score directly, without any explanation. Your output should be in the json format as follows: {"1": <insert your score here>, "2": <insert your score here>, ..., "9": "<insert your comments here>", ...} 2...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.