SAGE: Hierarchical LLM-Based Literary Evaluation through Ontology-Grounded Interpretive Dimensions

Nianjun Zhou; Tianyu Wang

arxiv: 2605.07102 · v1 · submitted 2026-05-08 · 💻 cs.CL

SAGE: Hierarchical LLM-Based Literary Evaluation through Ontology-Grounded Interpretive Dimensions

Tianyu Wang , Nianjun Zhou This is my paper

Pith reviewed 2026-05-11 00:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords literary evaluationLLM assessmentontology-grounded dimensionsgenre hierarchyautomated text evaluationshort story analysisinterpretive dimensionshierarchical framework

0 comments

The pith

A hierarchical LLM framework evaluates literary quality along ontology-grounded dimensions with high reliability and reveals a consistent genre ranking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAGE as a way to assess interpretive aspects of literature that resist simple computation by breaking quality into cultural representation, emotional depth, and philosophical sophistication. It applies structured LLM prompting with iterative reflection and validation across three analytical layers on a set of canonical, pulp, and AI-generated short stories. The approach yields strong consistency in scores and distinguishes the genres in a stable order, with bigger gaps in cultural and philosophical layers than in emotional ones. This matters because it supplies a repeatable method for judging open-ended creative text where existing metrics fall short and can flag specific limitations in current generative models.

Core claim

The SAGE framework decomposes literary quality into ontology-grounded interpretive dimensions and assesses them through structured large language model evaluation with multi-round iterative reflection and independent validation. When applied to 100 short stories across cultural, emotional-psychological, and existential-philosophical layers in dual modes, it produces consistent results that place canonical works above pulp fiction above LLM-generated narratives, with layer-specific patterns showing larger separations in critical and philosophical facets than in affective ones.

What carries the argument

Hierarchical evaluation framework that decomposes literary quality into ontology-grounded interpretive dimensions assessed through structured large language model evaluation with multi-round iterative reflection and independent validation.

If this is right

Enables scalable automated evaluation of open-ended text generation.
Identifies that affective patterns are more learnable from training data than critical stance or philosophical depth.
Supports systematic comparison of human and machine literary production.
Confirms the three dimensions measure empirically distinguishable facets of quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The layer-specific gaps suggest that future AI training could target explicit modeling of philosophical and cultural elements to close performance differences.
Similar hierarchical structures might apply to evaluating other creative forms such as poetry or longer narratives.
The method could function as a benchmark for measuring progress in AI narrative sophistication over successive model versions.
Extending evaluation to works from additional cultural traditions would test whether the observed hierarchy holds more broadly.

Load-bearing premise

The chosen dimensions of cultural representation, emotional depth, and philosophical sophistication validly and exhaustively capture literary quality and the resulting LLM judgments generalize beyond the tested stories and models.

What would settle it

A new test set of stories or different models that produces inter-rater agreement below 90 percent or reverses the observed genre ranking order would show the framework does not deliver the claimed reliability.

Figures

Figures reproduced from arXiv: 2605.07102 by Nianjun Zhou, Tianyu Wang.

**Figure 1.** Figure 1: SAGE framework architecture. (a) Six-layer hierarchy: Layers 1–3 assess textual properties through rule-based metrics [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Score convergence trajectories across five iterative rounds for each layer (L4–L6) and genre category. Scores stabilize [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 4.** Figure 4: Effect size (Cohen’s [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Evaluating literary quality requires assessing interpretive dimensions such as cultural representation, emotional depth, and philosophical sophistication that resist straightforward computational measurement. We introduce SAGE, a hierarchical evaluation framework that decomposes literary quality into ontology-grounded interpretive dimensions assessed through structured large language model evaluation with multi-round iterative reflection and independent validation. We validate the framework on 100 short stories (50 canonical works, 30 pulp fiction, 20 LLM-generated narratives) across three analytical layers (cultural, emotional-psychological, existential-philosophical) using dual-mode assessment. Across 600 evaluations, the framework achieves 98.8% score convergence and greater than 94% inter-rater agreement, with near-perfect mode invariance between content-based and metadata-based evaluation. Statistical analysis reveals a consistent genre hierarchy (Canonical > Pulp > LLM, all p<0.001) with layer-specific discrimination: cultural critique and philosophical depth exhibit very large effect sizes (Cohen's d>2.4), while emotional representation shows smaller gaps (d=1.68), suggesting that affective patterns are more learnable from training data than critical stance or philosophical depth. Cross-layer correlations (r=0.649-0.683) confirm the three dimensions capture empirically distinguishable quality facets. These findings demonstrate that theory-driven LLM evaluation can achieve measurement-grade reliability and support systematic identification of where current generative models fall short of human literary production, with direct implications for scalable automated evaluation of open-ended text generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGE delivers high internal consistency for LLM literary scoring but stays trapped in self-validation without human anchors.

read the letter

The main thing to know is that this paper gets LLMs to agree with each other at very high rates when scoring stories on cultural, emotional, and philosophical dimensions, yet it never checks whether those scores line up with what actual readers or critics would say. The framework itself is the new piece: it layers three ontology-grounded dimensions, runs multi-round reflection inside the model, and tests dual content-versus-metadata modes on a 100-story set split across canonical, pulp, and LLM-generated texts. That structure is more specific than generic LLM-as-judge prompts, and the reported numbers—98.8% convergence, >94% inter-rater agreement, near-perfect mode invariance—are clean enough to show the procedure is reproducible within the same model family. The genre hierarchy and layer-specific effect sizes (larger gaps in cultural and philosophical layers) also give a concrete picture of where current generators lag. Those results are worth having on the record. The soft spot is the complete absence of external calibration. Nothing in the work compares the LLM outputs to human literary experts using the same rubric on the same stories, so the claimed “measurement-grade reliability” rests only on internal agreement. That leaves the hierarchy open to the obvious alternative explanation that the models are simply reproducing patterns from their training data, especially when judging their own outputs. Prompt details and bias controls are also thin in the abstract, which makes the effect sizes harder to trust at face value. This paper is for groups already building automated creative-text evaluators who need a worked example of hierarchical prompting and reflection. A reader who cares about LLM judge protocols will find usable structure here even if the validity claims need more work. It deserves a serious referee because the framework is concrete and the statistical claims are testable; reviewers will almost certainly require human calibration data before acceptance.

Referee Report

3 major / 2 minor

Summary. The paper introduces SAGE, a hierarchical LLM-based framework for literary evaluation that decomposes quality into ontology-grounded interpretive dimensions (cultural representation, emotional depth, philosophical sophistication) assessed via structured prompting with multi-round reflection and dual-mode (content/metadata) validation. It tests the framework on 100 short stories (50 canonical, 30 pulp, 20 LLM-generated) across three analytical layers, reporting 98.8% score convergence, >94% inter-rater agreement, near-perfect mode invariance, and a statistically significant genre hierarchy (Canonical > Pulp > LLM, all p<0.001) with layer-specific effect sizes and cross-layer correlations (r=0.649-0.683).

Significance. If externally validated, the framework could offer a scalable, theory-driven method for automated assessment of open-ended text generation, with the layer-specific discrimination (large effects for cultural/philosophical vs. smaller for emotional) providing concrete guidance on where LLMs fall short of human literary production. The internal consistency metrics and reproducible procedure are strengths that could support systematic evaluation pipelines.

major comments (3)

[Abstract and Validation] Abstract and Validation section: The 98.8% convergence and >94% inter-rater agreement are produced entirely by LLM evaluators on LLM-generated and human stories without any reported calibration against human literary experts (e.g., critics or professors) using the identical rubric on the same 100 stories. This leaves the central claim of 'measurement-grade reliability' for the ontology-grounded dimensions resting solely on internal LLM consistency, which does not rule out shared training-data biases.
[Statistical Analysis and Results] Statistical Analysis and Results: The reported genre hierarchy (Canonical > Pulp > LLM, p<0.001) with very large effect sizes (Cohen's d>2.4 for cultural critique and philosophical depth) is presented as evidence of genuine interpretive discrimination, but without human ground-truth anchoring the hierarchy could equally reflect evaluator-model biases rather than valid quality differences.
[Framework Description] Framework Description: The three interpretive dimensions are asserted to be ontology-grounded and to capture distinguishable facets (supported by r=0.649-0.683 correlations), but no derivation from literary theory, justification for exhaustiveness, or test of whether they validly measure literary quality beyond the specific test set is provided.

minor comments (2)

[Abstract] Abstract: The dual-mode assessment is described as showing 'near-perfect mode invariance,' but the abstract provides no definition of what constitutes metadata-based evaluation or how it avoids leaking genre information.
[Methods] Methods: Details on exact prompt templates, the reflection process, dimension operationalizations, and selection criteria for the 100 stories are not summarized, which limits assessment of reproducibility even if present in the full text.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for these incisive comments, which correctly identify key limitations in the current validation approach. We respond point by point below and will revise the manuscript accordingly where feasible.

read point-by-point responses

Referee: [Abstract and Validation] The 98.8% convergence and >94% inter-rater agreement are produced entirely by LLM evaluators without any reported calibration against human literary experts using the identical rubric on the same 100 stories. This leaves the central claim of 'measurement-grade reliability' resting solely on internal LLM consistency, which does not rule out shared training-data biases.

Authors: We acknowledge this limitation: the reported metrics demonstrate internal reproducibility of the LLM evaluators but do not constitute external validation against human experts. The framework is explicitly positioned as an LLM-based method, and the observed alignment with expected genre distinctions provides indirect corroboration, yet we agree this does not fully exclude model-specific biases. We will add an explicit Limitations subsection in the Validation section that states the absence of human calibration and outlines a planned follow-up study with literary scholars applying the same rubric. revision: partial
Referee: [Statistical Analysis and Results] The reported genre hierarchy (Canonical > Pulp > LLM, p<0.001) with very large effect sizes is presented as evidence of genuine interpretive discrimination, but without human ground-truth anchoring the hierarchy could equally reflect evaluator-model biases rather than valid quality differences.

Authors: The referee is correct that the hierarchy remains correlational without human ground truth. We selected the story sets according to pre-existing literary classifications rather than post-hoc labeling, and the differential effect sizes across layers (larger for cultural/philosophical than emotional) match documented LLM shortcomings. Nevertheless, we will revise the Results and Discussion to frame the hierarchy more cautiously as preliminary evidence of discrimination and to stress the requirement for human-anchored validation in future work. revision: partial
Referee: [Framework Description] The three interpretive dimensions are asserted to be ontology-grounded and to capture distinguishable facets (supported by r=0.649-0.683 correlations), but no derivation from literary theory, justification for exhaustiveness, or test of whether they validly measure literary quality beyond the specific test set is provided.

Authors: The dimensions are drawn from core areas of literary theory (cultural studies for representation, affective criticism for emotional depth, and hermeneutic/existential approaches for philosophical sophistication). The moderate inter-layer correlations support their distinguishability. We will expand the Framework Description with explicit citations to relevant theorists and a clearer statement that the dimensions are not claimed to be exhaustive. We will also add a paragraph on generalizability limitations beyond the current test set. revision: yes

standing simulated objections not resolved

Direct calibration of SAGE scores against human literary experts (critics or professors) on the identical 100 stories using the same rubric

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces the SAGE framework as a hierarchical LLM-based evaluation using ontology-grounded dimensions and structured prompting with reflection. It then reports empirical outcomes from applying this procedure to an external test set of 100 stories (50 canonical, 30 pulp, 20 LLM-generated), yielding 98.8% score convergence, >94% inter-rater agreement, and mode invariance across 600 evaluations. These metrics are presented as measured results of the evaluation process rather than being presupposed by definition or fitted parameters renamed as predictions. No load-bearing self-citations, uniqueness theorems from prior author work, or ansatzes smuggled via citation are invoked in the provided text to justify core premises. The genre hierarchy and effect sizes are statistical findings from the data, not tautological reductions. The derivation remains self-contained against the described procedure and test stories.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the untested premise that the three interpretive dimensions are both necessary and sufficient for literary quality and that LLM outputs under structured prompting can be treated as stable measurements; no free parameters are explicitly fitted in the abstract, but the framework implicitly introduces many design choices in dimension definition and prompting.

axioms (2)

domain assumption Ontology-grounded interpretive dimensions (cultural, emotional-psychological, existential-philosophical) provide a valid decomposition of literary quality.
Invoked in the framework definition and used to structure all evaluations; no justification or external validation supplied in abstract.
domain assumption Multi-round iterative reflection plus independent validation produces reliable LLM judgments of open-ended text.
Central to the 98.8% convergence claim; treated as given rather than derived.

pith-pipeline@v0.9.0 · 5556 in / 1541 out tokens · 37877 ms · 2026-05-11T00:58:08.570950+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Computational Linguistics34(1):1–34

Modeling local coherence: An entity-based approach. Computational Linguistics34(1):1–34. [Berlant 2020] Berlant, L. 2020.Cruel optimism. Duke uni- versity press. [Bizzoni et al. 2023] Bizzoni, Y .; Moreira, P.; Thomsen, M. R.; and Nielbo, K. L. 2023. The fractality of sentiment arcs for literary quality assessment: The case of nobel laure- ates.Journal of...

work page arXiv 2020
[2]

Evidence Sovereignty: All scores must cite specific textual evidence

work page
[3]

Do NOT evaluate cultural power structures (Layer 4) or philosophical themes (Layer 6)

Strict Layer Boundary: Evaluate ONLY emotional-psychological content. Do NOT evaluate cultural power structures (Layer 4) or philosophical themes (Layer 6)

work page
[4]

Projection Awareness: Avoid imposing assumptions about ‘‘how emotions should be expressed.’’ Restrained expression̸= shallow

work page
[5]

Mode: content-limit Evaluate based SOLELY on textual evidence

Dimension Independence: Evaluate each dimension independently. Mode: content-limit Evaluate based SOLELY on textual evidence. Do NOT use prior knowledge about literary works or authors. Dimensions:

work page
[6]

[Sedgwick, Berlant, Ngai]

AC (Affective Complexity) -- Multiplicity, contradiction, and evolution of emotional states. [Sedgwick, Berlant, Ngai]

work page
[7]

[Cohn, Bruner, Bakhtin]

PI (Psychological Interiority) -- Depth and accessibility of characters’ inner psychological worlds. [Cohn, Bruner, Bakhtin]

work page
[8]

[Barrett]

EG (Emotional Granularity) -- Precision and differentiation of emotional vocabulary. [Barrett]

work page
[9]

ENC (Emotional-Narrative Coherence) -- Whether emotions arise organically from narrative situation. [New Criticism, James Wood] Round 1 User Prompt (condensed): ROUND 1/5: EMOTIONAL CONTENT EXTRACTION Step 1: Extract emotional content (explicit emotion words, psychological states, interior techniques, affective moments). Step 2: Score each dimension (1.0-...

work page
[10]

Hallucination Check: Did you make any claims not supported by the text? Did you fabricate cultural or historical details? Did you over-interpret ambiguous evidence?

work page
[11]

Confidence Calibration: Were confidence scores accurate given evidence strength?

work page
[12]

Reasoning Quality: Did you cite specific textual evidence for all claims?

work page
[13]

The prompt explicitly names non-Western traditions the evaluator must not overlook: ROUND 2/5: PROJECTION BIAS CHECK

Layer Boundary Check: Did you accidentally evaluate emotions (L5) or philosophical themes (L6)? Layer 6 (Existential): Round 2 addressesWestern frame- work imposition, a risk unique to philosophical evaluation. The prompt explicitly names non-Western traditions the evaluator must not overlook: ROUND 2/5: PROJECTION BIAS CHECK

work page
[14]

Western Existentialism Imposition: Did I project Sartre/Camus/Heidegger frameworks onto a text from a different philosophical tradition? Did I overlook Eastern depth (Buddhist impermanence, Confucian relational ethics, Daoist acceptance)?

work page
[15]

Explicit Philosophy Bias: Did I equate explicit philosophical discourse with depth, penalizing themes expressed through action or structure?

work page
[16]

Profundity Bias: Did I assume dark or tragic themes are more existentially profound than themes of joy, connection, or ordinary life?

work page
[17]

YOUR RESPONSIBILITIES:

Implicit Existentialism Omission: Did I miss existential depth expressed in non-Western vocabulary (Buddhist anicca/dukkha/anatta, Confucian li/ren/he, Daoist wu-wei)? Independent Validator Prompt (condensed).The valida- tor receives the complete five-round iterative conversation and applies the following system prompt for single-pass cross-verification: ...

work page
[18]

Projection Bias Detection: Identify if the iterative evaluator projected assumptions about emotional or philosophical expression; flag cultural bias in applied norms

work page
[19]

Hallucination Detection: Flag claims not grounded in the text; identify over-interpretations; check for layer boundary violations

work page
[20]

Reasoning Quality Assessment: Evaluate coherence of arguments; check whether evidence supports interpretations

work page
[21]

Confidence Calibration: Flag overconfidence (high confidence with ambiguous content) and underconfidence (low confidence with clear markers)

work page
[22]

CRITICAL MINDSET: Be skeptical but fair

Independent Scoring: Provide own dimension scores based on the text; compare with iterative evaluator’s scores; explain agreements and disagreements with specific textual evidence. CRITICAL MINDSET: Be skeptical but fair. Do not automatically agree. Focus on evidence quality, not score numbers. Avoid your own projection bias in the process of detecting th...

work page

[1] [1]

Computational Linguistics34(1):1–34

Modeling local coherence: An entity-based approach. Computational Linguistics34(1):1–34. [Berlant 2020] Berlant, L. 2020.Cruel optimism. Duke uni- versity press. [Bizzoni et al. 2023] Bizzoni, Y .; Moreira, P.; Thomsen, M. R.; and Nielbo, K. L. 2023. The fractality of sentiment arcs for literary quality assessment: The case of nobel laure- ates.Journal of...

work page arXiv 2020

[2] [2]

Evidence Sovereignty: All scores must cite specific textual evidence

work page

[3] [3]

Do NOT evaluate cultural power structures (Layer 4) or philosophical themes (Layer 6)

Strict Layer Boundary: Evaluate ONLY emotional-psychological content. Do NOT evaluate cultural power structures (Layer 4) or philosophical themes (Layer 6)

work page

[4] [4]

Projection Awareness: Avoid imposing assumptions about ‘‘how emotions should be expressed.’’ Restrained expression̸= shallow

work page

[5] [5]

Mode: content-limit Evaluate based SOLELY on textual evidence

Dimension Independence: Evaluate each dimension independently. Mode: content-limit Evaluate based SOLELY on textual evidence. Do NOT use prior knowledge about literary works or authors. Dimensions:

work page

[6] [6]

[Sedgwick, Berlant, Ngai]

AC (Affective Complexity) -- Multiplicity, contradiction, and evolution of emotional states. [Sedgwick, Berlant, Ngai]

work page

[7] [7]

[Cohn, Bruner, Bakhtin]

PI (Psychological Interiority) -- Depth and accessibility of characters’ inner psychological worlds. [Cohn, Bruner, Bakhtin]

work page

[8] [8]

[Barrett]

EG (Emotional Granularity) -- Precision and differentiation of emotional vocabulary. [Barrett]

work page

[9] [9]

ENC (Emotional-Narrative Coherence) -- Whether emotions arise organically from narrative situation. [New Criticism, James Wood] Round 1 User Prompt (condensed): ROUND 1/5: EMOTIONAL CONTENT EXTRACTION Step 1: Extract emotional content (explicit emotion words, psychological states, interior techniques, affective moments). Step 2: Score each dimension (1.0-...

work page

[10] [10]

Hallucination Check: Did you make any claims not supported by the text? Did you fabricate cultural or historical details? Did you over-interpret ambiguous evidence?

work page

[11] [11]

Confidence Calibration: Were confidence scores accurate given evidence strength?

work page

[12] [12]

Reasoning Quality: Did you cite specific textual evidence for all claims?

work page

[13] [13]

The prompt explicitly names non-Western traditions the evaluator must not overlook: ROUND 2/5: PROJECTION BIAS CHECK

Layer Boundary Check: Did you accidentally evaluate emotions (L5) or philosophical themes (L6)? Layer 6 (Existential): Round 2 addressesWestern frame- work imposition, a risk unique to philosophical evaluation. The prompt explicitly names non-Western traditions the evaluator must not overlook: ROUND 2/5: PROJECTION BIAS CHECK

work page

[14] [14]

Western Existentialism Imposition: Did I project Sartre/Camus/Heidegger frameworks onto a text from a different philosophical tradition? Did I overlook Eastern depth (Buddhist impermanence, Confucian relational ethics, Daoist acceptance)?

work page

[15] [15]

Explicit Philosophy Bias: Did I equate explicit philosophical discourse with depth, penalizing themes expressed through action or structure?

work page

[16] [16]

Profundity Bias: Did I assume dark or tragic themes are more existentially profound than themes of joy, connection, or ordinary life?

work page

[17] [17]

YOUR RESPONSIBILITIES:

Implicit Existentialism Omission: Did I miss existential depth expressed in non-Western vocabulary (Buddhist anicca/dukkha/anatta, Confucian li/ren/he, Daoist wu-wei)? Independent Validator Prompt (condensed).The valida- tor receives the complete five-round iterative conversation and applies the following system prompt for single-pass cross-verification: ...

work page

[18] [18]

Projection Bias Detection: Identify if the iterative evaluator projected assumptions about emotional or philosophical expression; flag cultural bias in applied norms

work page

[19] [19]

Hallucination Detection: Flag claims not grounded in the text; identify over-interpretations; check for layer boundary violations

work page

[20] [20]

Reasoning Quality Assessment: Evaluate coherence of arguments; check whether evidence supports interpretations

work page

[21] [21]

Confidence Calibration: Flag overconfidence (high confidence with ambiguous content) and underconfidence (low confidence with clear markers)

work page

[22] [22]

CRITICAL MINDSET: Be skeptical but fair

Independent Scoring: Provide own dimension scores based on the text; compare with iterative evaluator’s scores; explain agreements and disagreements with specific textual evidence. CRITICAL MINDSET: Be skeptical but fair. Do not automatically agree. Focus on evidence quality, not score numbers. Avoid your own projection bias in the process of detecting th...

work page