arxiv: 2604.03695 · v1 · submitted 2026-04-04 · 💻 cs.CL

Recognition: no theorem link

POEMetric: The Last Stanza of Humanity

Bingru Li , Han Wang , Hazel Wilkinson

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords poetry evaluationLLM poetry generationhuman vs AI comparisoncreative writing assessmentliterary devicespoem qualityform and theme adherence

0 comments

The pith

Current language models can follow poem forms and themes but fall short of humans in creativity, emotional resonance, and overall quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents POEMetric as a new way to measure poetry generation by LLMs against human work. It separates basic tasks of matching a given form and theme from harder tasks of showing originality, evoking feeling, and deploying imagery and literary devices. Experiments with dozens of models and a set of human poems reveal that top LLMs reach high marks on the basic tasks yet score lower on the advanced ones and on total poem quality. A reader cares because the gap points to persistent limits in how models produce distinctive artistic language even when they can copy surface rules.

Core claim

The paper establishes that while large language models generate poems that closely match specified forms and themes, they do not reach human levels of creativity, idiosyncrasy, emotional resonance, or skillful use of imagery and literary devices, and therefore produce poems of lower overall quality.

What carries the argument

POEMetric, the three-part evaluation framework that scores basic form-and-theme adherence, advanced poetic abilities, and overall quality through rule-based checks plus LLM-as-judge ratings validated by human experts.

If this is right

LLMs reach high accuracy on replicating fixed poetic forms and stated themes.
Human poets score higher than any tested model on creativity, idiosyncrasy, emotional resonance, imagery, and literary devices.
Overall poem quality is higher for human work than for the strongest LLM output.
Poetry generation continues to pose a substantial challenge for large language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Further scaling of model size or training data may not automatically close the observed gap in creative expression.
The same evaluation approach could be applied to other creative writing tasks to test whether the pattern holds beyond poetry.
Results suggest that human judgment will stay necessary for assessing subjective artistic qualities even as automated judges improve.

Load-bearing premise

That LLM-as-a-judge scores together with human expert validation give a reliable measure of subjective qualities such as creativity and emotional resonance.

What would settle it

A new set of poems generated by current models that human experts rate higher than the human poems in creativity, emotional resonance, imagery, literary devices, and overall quality.

Figures

Figures reproduced from arXiv: 2604.03695 by Bingru Li, Han Wang, Hazel Wilkinson.

**Figure 2.** Figure 2: An example of the human poem data and the generation prompt for LLMs. On the left are [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: A showcase of the poems by DeepSeek-R1 (Poem A) and a human poet (Poem B) in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Chain-of-Thought (CoT) process from DeepSeek-R1 for the poem generation. The model [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Rule-based evaluation results. LLMs were able to achieve high form accuracy and MATTR. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Form accuracy and theme alignment scores. Gemini-2.5-Pro achieved the highest scores in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Advanced creative abilities. Compared with LLMs, human poets excelled in creativity, [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Overall poem quality and human authorship estimation scores. Humans ranked first in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: The mean scores of POEMetric of human poets and 30 LLMs, evaluated by Gemini-2.5-Pro. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: The top 20 words across the human and LLM poem datasets. [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: The top opening words and top imagery cross the human and LLM poem datasets.. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: A showcase of the poems by Claude-3.7-Sonnet and a human poet in response to the same [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: A showcase of the poems by Gemini-2.5-Pro and a human poet in response to the same [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Basic Instruction-Following Abilities, Average Scores [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Advanced Creative Abilities, Average Scores [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

**Figure 16.** Figure 16: Overall Poem Quality, Average Scores 1 2 4 8 16 32 64 128 256 512 1024 Unknown Parameters (B) 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Human Authorship Estimation, Average Scores DeepSeek-R1-671B DeepSeek-v3-671B Qwen2.5-72B 32B 14B 7B 3B 1.5B 0.5B QwQ-32B 14B D.-Qwen-32B 7B 1.5B 3.1-8B Llama-3.3-70B D.-Llama-70B 8B Mistral-L.-123B Gemma-3-27B Claude-3.7-Sonnet Claude-3.5-Sonnet Gemini-2.0-Pro Gemini-2.5-Pro GPT-4.5 o… view at source ↗

**Figure 17.** Figure 17: Human Authorship Estimation, Average Scores [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

read the original abstract

Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini-2.5-Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best-performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs. Data and codes are released at https://github.com/Bingru-Li/POEMetric.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POEMetric introduces a practical three-tier benchmark and dataset for poetry generation that shows LLMs matching humans on form but lagging on creativity and emotional resonance, though the subjective scoring lacks key validation details.

read the letter

The main takeaway is that this paper gives us POEMetric, a new framework that splits poetry evaluation into basic form and theme adherence, advanced creative qualities like emotional resonance and literary devices, and overall quality. They built a 203-poem human dataset with annotations and compared 30 LLMs across 6090 generated poems, finding that top models score well on structure but fall short on the harder human-like elements. Humans come out ahead on overall quality too. They release the data and code, which helps others build on it directly.

Referee Report

2 major / 1 minor

Summary. The paper introduces POEMetric, a comprehensive evaluation framework for poetry generation that assesses LLMs on basic instruction-following (form and theme adherence via rule-based metrics), advanced abilities (creativity, idiosyncrasy, emotional resonance, imagery, and literary devices via LLM-as-a-judge), and overall quality plus authorship estimation. It curates a dataset of 203 human poems across 7 fixed forms, generates 6,090 poems from 30 LLMs using the same prompts, and reports that top LLMs reach high form accuracy (4.26/5) and theme alignment (4.99) but lag humans on advanced qualities (e.g., human creativity 4.02, emotional resonance 4.06, overall quality 4.22 vs. 3.20 for best LLM) and that poetry remains a challenge for LLMs. Results are validated by human experts.

Significance. If the subjective evaluation components prove reliable, the work supplies a useful benchmark and dataset release that quantifies the gap between LLMs and humans on creative dimensions of poetry, underscoring that structural compliance is easier for models than nuanced artistic qualities. The dual use of rule-based and judge-based metrics is a constructive step toward more reproducible assessment in creative NLP.

major comments (2)

[Evaluation Methodology] Evaluation Methodology section: The abstract and results state that LLM-as-a-judge outputs (e.g., creativity, emotional resonance, overall quality scores) were 'validated by human experts,' yet no details are supplied on expert count, blinding, inter-rater reliability (Cohen’s κ or ICC), or correlation between LLM and human ratings. Because the headline claim that LLMs lag on advanced abilities rests entirely on these subjective scores, the absence of these statistics prevents verification of the reported gaps (e.g., 4.02 vs. lower LLM values).
[Results] Results section: The identification of Gemini-2.5-Pro as the top model and the assertion that 'all models failed' to match human advanced-ability levels appear post-hoc; the manuscript does not report pre-registered statistical tests, effect sizes, or multiple-comparison corrections across the 30 models. This weakens the strength of the cross-model and human-LLM comparisons.

minor comments (1)

[Abstract] Abstract: The scoring rubrics for the 1–5 subjective scales are not described even at a high level; a single sentence summarizing the rubric anchors would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript to improve clarity and rigor in the evaluation methodology and statistical reporting.

read point-by-point responses

Referee: [Evaluation Methodology] Evaluation Methodology section: The abstract and results state that LLM-as-a-judge outputs (e.g., creativity, emotional resonance, overall quality scores) were 'validated by human experts,' yet no details are supplied on expert count, blinding, inter-rater reliability (Cohen’s κ or ICC), or correlation between LLM and human ratings. Because the headline claim that LLMs lag on advanced abilities rests entirely on these subjective scores, the absence of these statistics prevents verification of the reported gaps (e.g., 4.02 vs. lower LLM values).

Authors: We agree that the current manuscript omits important details on the human validation process. In the revised version, we will expand the Evaluation Methodology section to specify the number of human experts, the blinding procedures used, inter-rater reliability statistics (e.g., ICC or Cohen’s κ), and the correlation between LLM-as-a-judge scores and human ratings. This addition will directly support verification of the reported performance gaps. revision: yes
Referee: [Results] Results section: The identification of Gemini-2.5-Pro as the top model and the assertion that 'all models failed' to match human advanced-ability levels appear post-hoc; the manuscript does not report pre-registered statistical tests, effect sizes, or multiple-comparison corrections across the 30 models. This weakens the strength of the cross-model and human-LLM comparisons.

Authors: We acknowledge that the model ranking and claims were derived from observed results rather than pre-registered analyses. In the revision, we will add effect sizes for human-LLM and cross-model comparisons, apply appropriate multiple-comparison corrections, and explicitly note the exploratory nature of the analysis. Pre-registration was not conducted for this initial study, which we will discuss as a limitation while ensuring the reported differences are supported by the added statistical details. revision: partial

Circularity Check

0 steps flagged

No circularity in POEMetric derivation chain

full rationale

The paper defines POEMetric as a new composite framework: rule-based metrics for form accuracy and theme alignment plus LLM-as-a-judge (Gemini-2.5-Pro) for creativity, emotional resonance, etc., with separate human-expert validation. Human poems come from an external curated dataset of 203 poems; LLM poems are generated to match the same forms/themes and then scored under the identical protocol. No equations, fitted parameters, or self-citations reduce any reported gap (e.g., human creativity 4.02 vs. LLM) to the input data by construction. The evaluation protocol is externally anchored and does not contain self-definitional or load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that the chosen evaluation dimensions and human validation accurately capture poetic quality; no free parameters are described, and the framework itself is the primary invented construct.

axioms (1)

domain assumption Human expert validation reliably confirms the accuracy of LLM-as-a-judge scores for subjective poetic attributes.
Paper states results were validated by human experts without reporting agreement metrics or selection criteria.

invented entities (1)

POEMetric framework no independent evidence
purpose: Comprehensive multi-aspect poetry evaluation system
Newly proposed evaluation structure covering instruction following, advanced abilities, and overall appraisal.

pith-pipeline@v0.9.0 · 5617 in / 1349 out tokens · 52630 ms · 2026-05-13T17:10:56.228104+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

(Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

The poem follows the given prompt in terms of form, including meter and rhyme where applicable. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[2]

(Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 17 Published as a conference paper at ICLR 2026 4 - Agree 5 - Strongly agree

The poem follows the given prompt in terms of its theme. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 17 Published as a conference paper at ICLR 2026 4 - Agree 5 - Strongly agree

work page 2026
[3]

(Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

The poem uses a varied vocabulary. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[4]

(Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

The poem is a creative work. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[5]

(Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

This poem shows idiosyncrasy. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[6]

(Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

This poem evokes emotional resonance. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[7]

(Required) __________ 18 Published as a conference paper at ICLR 2026 0 - N/A (No imagery is used) 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

The imagery in this poem is used well. (Required) __________ 18 Published as a conference paper at ICLR 2026 0 - N/A (No imagery is used) 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page 2026
[8]

At least one of the literary devices listed below is used well in the poem. (Required) __________ - Simile - Metaphor - Personification - Allusion 0 - N/A (No literary devices are used) 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[9]

___________________________________________________________________

Please comment on why you gave the answer that you did for question 8 above. ___________________________________________________________________

work page
[10]

(Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

This is a good poem. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[11]

___________________________________________________________________ 19 Published as a conference paper at ICLR 2026

Please comment on why you gave the answer that you did for question 10 above. ___________________________________________________________________ 19 Published as a conference paper at ICLR 2026

work page 2026
[12]

(Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

The poem is written by a human. (Required) __________ 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[13]

Please give comments on why you gave the answer that you did for question 12 above. ___________________________________________________________________ 20 Published as a conference paper at ICLR 2026 POEMetric-based LLM Evaluation prompt # Role Description You are a professional poetry critic and analyst. Your job is to evaluate English poetry written by ...

work page 2026
[14]

1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

The poem follows the given prompt in terms of form, including meter and rhyme where applicable. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[15]

1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree 21 Published as a conference paper at ICLR 2026

The poem follows the given prompt in terms of its theme. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree 21 Published as a conference paper at ICLR 2026

work page 2026
[16]

1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

The poem uses a varied vocabulary. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[17]

1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

The poem is a creative work. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[18]

1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

This poem shows idiosyncrasy. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[19]

1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

This poem evokes emotional resonance. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[20]

0 - N/A (No imagery is used) 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

The imagery in this poem is used well. 0 - N/A (No imagery is used) 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[21]

At least one of the literary devices listed below is used well in the poem. - Simile - Metaphor - Personification - Allusion 0 - N/A (No literary devices are used) 1 - Strongly disagree 22 Published as a conference paper at ICLR 2026 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page 2026
[22]

Please comment on why you gave the answer that you did for question 8 above

work page
[23]

1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

This is a good poem. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[24]

Please comment on why you gave the answer that you did for question 10 above

work page
[25]

1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

The poem is written by a human. 1 - Strongly disagree 2 - Disagree 3 - Neutral 4 - Agree 5 - Strongly agree

work page
[26]

1": <insert your score here>,

Please give comments on why you gave the answer that you did for question 12 above. ## Output Format For each multiple-choice question, please give your score directly, without any explanation. Your output should be in the json format as follows: {"1": <insert your score here>, "2": <insert your score here>, ..., "9": "<insert your comments here>", ...} 2...

work page 2026