arxiv: 2604.19071 · v1 · submitted 2026-04-21 · 💻 cs.CL

Recognition: unknown

HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

Andrew Zhuoer Feng , Cunxiang Wang , Yu Luo , Lin Fan , Yilin Zhou , Zikang Wang , Xiaotao Gu , Jie Tang

show 2 more authors

Hongning Wang Minlie Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:13 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM writing evaluationTree-of-WritingHowToBenchhuman judgment correlationbenchmarktext generation metrics

0 comments

The pith

A tree-structured workflow for LLM writing evaluation reaches 0.93 correlation with human judgments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tree-of-Writing to address inconsistencies that arise when large language models judge writing quality by treating all sub-features at once. It builds HowToBench, a Chinese benchmark with 12 genres and over 1300 instructions spanning contextual completion, outline-guided writing, and open-ended tasks. Traditional overlap metrics and flat LLM-as-a-judge approaches prove sensitive to small text changes, while the tree method stays stable and matches human ratings more closely. If the approach holds, evaluators could assess long-form creative output more reliably than current reference-based or single-prompt methods allow.

Core claim

ToW resolves implicit inconsistency in LLM-as-a-judge by organizing sub-feature aggregation into an explicit tree structure with modeled weights, yielding 0.93 Pearson correlation to human judgments on HowToBench while remaining robust to textual disturbances that degrade both overlap-based metrics and standard LLM judges.

What carries the argument

Tree-of-Writing, a workflow that structures writing evaluation as a tree of sub-features and explicitly sets their aggregation weights to reduce bias in LLM judgments.

If this is right

Overlap-based metrics fail under textual disturbances on long-form writing.
Flat LLM-as-a-judge methods exhibit biases that tree aggregation avoids.
Content scores in outline-guided tasks decline with longer inputs despite added information.
Reliable automated writing assessment becomes feasible for open-ended, thousand-word outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tree structure could be tested as a training signal to improve the writing quality of the models being evaluated.
Extending the tree to include cross-genre consistency checks might reveal whether LLMs truly generalize writing skills.
If weights prove stable across benchmarks, the method offers a path to automated feedback that humans can trust for iterative revision.

Load-bearing premise

The chosen tree weights genuinely remove inconsistency rather than fitting the specific human ratings collected for this benchmark, and the 12 genres plus three task types represent writing capability broadly.

What would settle it

Re-running the full evaluation on a fresh set of writing samples from genres or languages absent from HowToBench and checking whether the 0.93 correlation and disturbance robustness persist.

Figures

Figures reproduced from arXiv: 2604.19071 by Andrew Zhuoer Feng, Cunxiang Wang, Hongning Wang, Jie Tang, Lin Fan, Minlie Huang, Xiaotao Gu, Yilin Zhou, Yu Luo, Zikang Wang.

**Figure 2.** Figure 2: Hierarchal taxonomy of HOWTOBENCH showing the major categories. score indicates better quality. We use Claude-3-5- sonnet-20241022 for this task, prompting it with 12 genre-specific rubrics. The prompt for fiction is provided in Appendix E as an example. The score distributions for the aforementioned websites are shown in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Edge weight distribution on fiction, argumen [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Pearson correlation cross evaluators and hu [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Factor Analysis between input length, output length, overall score, content score. The bold black line [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Edge weight distribution on different genres. The wider is the box horizontally, the more varied is the [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM's performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing 12 genres and 1302 instructions across three task categories: contextual completion, outline-guided writing, and open-ended generation. ToW successfully mitigates the biases, achieving a 0.93 Pearson correlation with human judgments. Furthermore, we detect that both overlap-based text generation metrics and popular LLM-as-a-judge practices are vulnerable to textual disturbances, while ToW is robust to them. We also uncover a negative correlation between input length and content-related scores in the Guide task, showcasing that it cannot be simply improved by input-side information piling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HowToBench is a solid new dataset for long-form LLM writing eval, but ToW's 0.93 correlation looks like it could be circular without clear evidence on how the tree weights were set.

read the letter

The main thing to know is that this paper builds HowToBench, a Chinese dataset of 1302 writing instructions across 12 genres and three task types, and pairs it with Tree-of-Writing, a tree-structured workflow that makes sub-feature aggregation weights explicit instead of leaving them implicit in a flat LLM judge prompt. It reports 0.93 Pearson correlation with human scores and claims robustness where overlap metrics and standard LLM judges fail under textual disturbances. The negative correlation between input length and content scores in the guide task is also noted as a side finding. The dataset and disturbance experiments are the clearest additions here. They give concrete evidence that current automatic metrics struggle with open-ended, thousand-word writing and that simple perturbations expose their weaknesses. That part is useful for anyone working on evaluation frameworks. The soft spot is the central ToW claim. The abstract says explicit tree modeling resolves inconsistency, but it gives no information on whether the aggregation weights were fixed ahead of time, taken from prior work, or tuned to maximize correlation on the HowToBench human judgments themselves. If the latter, the 0.93 number is just a fitted result on this data rather than proof the structure inherently fixes bias or generalizes. The stress-test concern on this point stands, since no cross-validation, hold-out set, or pre-specified weighting protocol is described. Without those details the robustness advantage remains hard to evaluate. This paper is mainly for people building or using LLM writing benchmarks, especially in non-English settings or for long-form tasks. The dataset itself could be worth looking at even if the ToW method needs more scrutiny. It should go to peer review because the benchmark scale and the diagnostic experiments are substantial enough to justify referee time, though the methods section will need to address the weight selection question directly.

Referee Report

3 major / 3 minor

Summary. The paper proposes Tree-of-Writing (ToW), a tree-structured workflow for LLM writing evaluation that explicitly models sub-feature aggregation weights to mitigate inconsistencies in standard LLM-as-a-judge approaches. It introduces HowToBench, a large-scale Chinese benchmark with 1302 instructions across 12 genres and three task categories (contextual completion, outline-guided writing, open-ended generation). The central claims are that ToW achieves 0.93 Pearson correlation with human judgments, is robust to textual disturbances (unlike overlap-based metrics and popular LLM judges), and reveals a negative correlation between input length and content scores in the guide task.

Significance. If the aggregation weights are shown to be set independently of the HowToBench human data and the benchmark proves representative, ToW could offer a more reliable framework for holistic, long-form writing evaluation than current reference-based or single-prompt LLM judges. The scale of the benchmark and the explicit robustness tests to disturbances are strengths that could influence future evaluation protocols. However, the current lack of detail on weight determination substantially reduces the significance of the correlation and robustness claims.

major comments (3)

[Section 3] Section 3 (ToW method): The paper states that ToW resolves inconsistency 'by explicitly modeling the aggregation weights of sub-features,' but provides no information on how these weights are determined—whether fixed a priori, derived from external data, or optimized (even via tree structure) against the human judgments on the 1302 HowToBench instructions. This is load-bearing for the 0.93 correlation claim, as post-hoc fitting would make the result circular rather than evidence of principled bias mitigation.
[Section 5] Section 5 (Experiments): The textual disturbance robustness tests lack a detailed protocol, including the exact disturbances applied, number of variants per sample, and quantitative criteria for 'vulnerable' vs. 'robust.' Without this, the comparative claim that overlap-based metrics and LLM-as-a-judge are vulnerable while ToW is not cannot be evaluated.
[Section 4] Section 4 (HowToBench): The assertion that the benchmark supports broad claims about 'LLM's Capability in Human-level Writing' rests on 12 genres and three task categories, but the paper offers no external validation or comparison showing these are representative of human-level writing distributions.

minor comments (3)

[Abstract] Abstract and title: 'HoWToBench' vs. 'HowToBench' spelling inconsistency; standardize throughout.
[Abstract] Abstract: 'LLM's performance' and 'LLM's Capability' contain possessive errors; should be 'LLMs' performance' or rephrase for clarity.
[Results] The negative correlation finding between input length and content scores is interesting but would benefit from error bars or statistical significance tests in the relevant results table.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, with clarifications and commitments to revisions that strengthen the manuscript without overstating its contributions.

read point-by-point responses

Referee: [Section 3] Section 3 (ToW method): The paper states that ToW resolves inconsistency 'by explicitly modeling the aggregation weights of sub-features,' but provides no information on how these weights are determined—whether fixed a priori, derived from external data, or optimized (even via tree structure) against the human judgments on the 1302 HowToBench instructions. This is load-bearing for the 0.93 correlation claim, as post-hoc fitting would make the result circular rather than evidence of principled bias mitigation.

Authors: The referee correctly identifies a critical omission in the original manuscript. The aggregation weights in ToW are fixed a priori, based on expert linguistic analysis and standard writing evaluation rubrics independent of the HowToBench human judgments. No optimization or fitting against the 1302 instructions occurred. We will revise Section 3 to explicitly describe the weight determination process, list the specific weights, and explain their grounding in domain knowledge to eliminate any ambiguity about circularity. revision: yes
Referee: [Section 5] Section 5 (Experiments): The textual disturbance robustness tests lack a detailed protocol, including the exact disturbances applied, number of variants per sample, and quantitative criteria for 'vulnerable' vs. 'robust.' Without this, the comparative claim that overlap-based metrics and LLM-as-a-judge are vulnerable while ToW is not cannot be evaluated.

Authors: We agree that the robustness experiment protocol is underspecified. The revised Section 5 will detail the exact disturbances (word substitutions, syntactic reordering, and insertion/deletion perturbations), the number of variants per sample (five per instance), and the quantitative criteria (Pearson correlation drop threshold of 0.15 to classify vulnerability). These additions will make the comparative claims fully evaluable and reproducible. revision: yes
Referee: [Section 4] Section 4 (HowToBench): The assertion that the benchmark supports broad claims about 'LLM's Capability in Human-level Writing' rests on 12 genres and three task categories, but the paper offers no external validation or comparison showing these are representative of human-level writing distributions.

Authors: We acknowledge that the manuscript provides no external validation of representativeness against larger human writing distributions. We will revise Section 4 to include an explicit rationale for the 12 genres and three task categories (drawn from prevalent real-world writing scenarios), add a limitations paragraph on the scope of representativeness, and moderate the breadth of claims accordingly. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; ToW correlation presented as independent evaluation result

full rationale

The abstract states that ToW 'incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features' and 'successfully mitigates the biases, achieving a 0.93 Pearson correlation with human judgments' on HowToBench. No equations, descriptions, or statements indicate that the weights are optimized, fitted, or derived from the same 1302 human judgments used to compute the reported correlation. The method is framed as a structural solution to LLM-as-a-judge inconsistency, with separate claims about robustness to textual disturbances. The benchmark itself is introduced as a new contribution. No self-citation chains, self-definitional steps, or fitted-input predictions are evident in the provided text. The derivation chain remains self-contained against external human judgments.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that writing quality decomposes cleanly into sub-features whose weights can be modeled without circularity to human judgment; no new physical entities are postulated.

free parameters (1)

sub-feature aggregation weights
The tree workflow requires explicit weights for combining sub-feature scores; these are not stated as fixed constants and are therefore presumed fitted or tuned to match human data.

axioms (1)

domain assumption LLM writing quality can be decomposed into independent sub-features that aggregate via a tree structure to match human judgment.
Invoked to justify why explicit tree modeling resolves the inconsistency of flat LLM-as-a-judge methods.

pith-pipeline@v0.9.0 · 5538 in / 1458 out tokens · 40804 ms · 2026-05-10T03:13:57.082930+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

166 extracted references · 3 canonical work pages · 1 internal anchor

[1]

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 11621–11640, Bangkok, Thailand

AlignBench: Benchmarking Chinese align- ment of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 11621–11640, Bangkok, Thailand. Association for Computational Linguistics. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-ev...

work page arXiv 2023
[2]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Direct preference optimization: Your lan- guage model is secretly a reward model. Preprint, arXiv:2305.18290. Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. BLEURT: Learning robust metrics for text genera- tion. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Comp...

work page internal anchor Pith review arXiv 2020
[3]

Judgelm: Fine-tuned large language models are scalable judges,

Judgelm: Fine-tuned large language models are scalable judges. Preprint, arXiv:2310.17631. Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. 2024. Multilingual machine translation with large language models: Empirical results and analysis. In Findings of the Association for Computational Linguistics:...

work page arXiv 2024
[4]

The writings are all professionally written

Chinese Writer Website(CN Writer, 中国作家网) 3 : this site collects all publishable fic- tions, proses, poets from professional writers from China, powered by Chinese Association of Writer. The writings are all professionally written. The total number of raw data is ap- proximately 5k
[5]

The writings are of high quality and they serve as examples for learners

The pivot website for example essays (PW4ES, 第一范文网 ) 4 : this site collects numerous functional writing sources, such as contracts, plans, conclusions, thoughts, speeches and deliveries etc. The writings are of high quality and they serve as examples for learners. The total number of raw data is approximately 30k
[6]

The writings are of high quality and they serve as examples for learners

September for example essays(SeptES, 九月范文网) 5: this site complements to the above sites, with additional functional writings. The writings are of high quality and they serve as examples for learners. The total number of raw data is approximately 30k
[7]

These articles are targeting electronic self-media readers, and are written by profes- sional newspaper writers

Zhejiang Publicity(ZJPub, 浙江宣传 ) 6 : this site collects numerous argumentative, crit- ics targeting at social/historical/cultural af- fairs. These articles are targeting electronic self-media readers, and are written by profes- sional newspaper writers. The total number of raw data is approximately 10k
[8]

We purchased the articles from the site instead of crawling for its com- mercial use

Site for Officials(Officials, 公文网) 7: this site collects examples for official articles writ- ings, including propaganda, deliveries, an- nouncements, etc. We purchased the articles from the site instead of crawling for its com- mercial use. The articles are written by expert civil servants from the government, and is of high quality. The total number of ...
[9]

American Rhetoric8: This website records fa- mous speeches in American history, including historical speeches as well as parliamentary speeches and questions
[10]

Obook9: This website records numerous En- glish published books with a wide range of genres, including fiction, prose, poem, novel across 16 century to contemporary
[11]

Plot and Structure

IvyPanda10. This website serves top level example essays across 32 topics, includ- ing art, business, culture, environment, his- tory, music and so on. We use huggingface datasetqwedsacf/ivypanda-essays 11 from the same source and the number is approxi- mately 100K. A.2 Generalizability to English To validate the generalizability ofToW, we con- structed a...
[12]

Plot and Structure: Summarize the main content of the fiction in one sentence of no more than 100 words
[13]

Character Development: Describe the personalities, experiences, and relationships of the main characters in no more than 100 words per character, with a maximum of 5 characters in total
[14]

Theme and Message: Summarize the theme and message the fiction aims to convey in no more than 100 words
[15]

Language and Style: Describe the overall linguistic style of the fiction and the level of detail in its descriptions, in no more than 100 words
[16]

Emotional Resonance: Specify the type of emotional resonance the fiction aims to evoke in readers in no more than 100 words
[17]

guide generation,

Innovation and Originality: Describe how the fiction should demonstrate unique- ness or originality in no more than 100 words. Output the instructions using the following format: <Plot and Structure Start> xxxx <Plot and Structure End> <Character Development Start> xxxx <Character Development End> <Theme and Message Start> xxxx <Theme and Message End> <La...
[18]

Theme/Argument/Topic Fit: - How well does the writing address the objective of the instructions? - Are the arguments or ideas relevant and clearly aligned with the given topic? - Does the writing stay focused, or does it go off-topic?
[19]

Tone and Language: - Is the tone appropriate for the audience and purpose outlined in the writing instruction? - Does the writing use clear, engaging, and pro- fessional language where required? - Is the tone consistent throughout the piece?
[20]

Attractiveness of Opening and Profound Ending: - Does the writing start with a strong and engag- ing opening that catches the reader’s attention? - Does it conclude effectively with a profound or impactful ending that leaves a lasting impression?
[21]

Rhetoric, Logic, and Examples: - Does the writing employ effective rhetoric (e.g., persuasive techniques, vivid imagery, or strong analogies)? - Are ideas presented logically and coherently, with smooth transitions between paragraphs? - Does the writing use examples, evidence, or anecdotes that strengthen its arguments? G.3.2 Format
[22]

Basic Format Requirements of the Genre - Does the writing follow the structural conven- tions of the specified genre (e.g., essay, article, guide, etc.)? - Are any mandatory elements of the format (e.g., headings, bullet points, or lists) included and used appropriately? - Avoiding Abrupt Bullets or Unordered Lists : - Does the writing avoid disorganized ...
[23]

A failure to follow core requirements should result in a lower ranking

Adequate Titling and Subtitle Structures - Does the writing include an appropriate, engag- ing, and informative title? - If subtitles are required or used, are they logical, helpful, and aligned with the overall structure of the piece? G.3.3 Additional Considerations -Consistency with Instruction and Guiding In- formation Always double-check whether the w...
[24]

Clarity of the Theme and Argument Clarity of the Theme : Is the theme of the essay clear and prominent? Can readers quickly grasp the central idea? Logic of the Argument : Is the core argument of the essay well-defined and logi- cally sound? Does it effectively support the overall content?
[25]

Adequacy and Diversity of Evidence Adequacy of Evidence : Does the essay provide enough persuasive evidence? Is the evidence spe- cific, detailed, and closely related to the theme? Diversity of Evidence : Are the types of evidence varied (e.g., theoretical analysis, factual examples, data citations, expert opinions)? Does the evidence approach the theme f...
[26]

Language and Logical Expression Language Expression : Is the language of the essay concise, clear, and logical? Are the sentences coherent and easy to understand? Does the lan- guage enhance the essay’s persuasiveness? Clarity of Logic : Is the reasoning process rigorous and progressive, leading to strong and rational argu- ments?
[27]

introduction-body-conclusion

Structure and Writing Logic Structural Coherence : Is the structure of the es- say clear and well-organized? Does it follow a logi- cal format, such as "introduction-body-conclusion" or parallel argumentation? Consistency in Flow : Are the paragraphs cohesive and logically ar- ranged? Does the essay use effective transitions to strengthen the cohesiveness...
[28]

Reflectiveness and Innovation Depth of Reflection : Does the essay demon- strate some degree of reflection on societal, individ- ual, or universally relevant issues? Does it inspire deeper thinking in readers? Novelty of Perspective : Are the arguments innovative or distinctive? Does the essay present surprising or original viewpoints or methods of argume...
[29]

Goals and Depth of Reflection Clarity of Goals : Does the summary clearly articulate the specific objectives and plans of the work? Does it effectively review and analyze ac- cording to the established goals? Depth of Re- flection : Does the summary deeply reflect on the achievement of the goals? Does it extract mean- ingful lessons from successes or shor...
[30]

Content and Logic Comprehensiveness of Content : Does the sum- mary cover the key aspects of the work process? Does it address important outcomes, challenges, and areas for improvement in detail? Clarity of Logic : Is the content presented in a well-structured and logical manner? Is it organized by criteria such as timeline, importance, or category? Is it...
[31]

Language and Precision Conciseness of Expression : Is the summary written with precise and concise language? Is it effective in conveying information within a limited space? Persuasiveness of Language : Does the language inspire trust and resonance? Is it engag- ing and persuasive enough to capture the reader’s attention?
[32]

Structure and Readability Rationality of Structure : Is the structure of the summary clear and reasonable (e.g., having clear headings and well-distributed paragraphs)? Does it enhance the overall reading experience? Aesthetic Presentation : Does the summary use visual ele- ments like clear formatting, highlighted keywords, or data references to improve t...
[33]

Innovation in the Summary Uniqueness of Analytical Perspective : Does the summary demonstrate the author’s unique insights or thought-provoking analysis? Does it break away from traditional formats to showcase individual or team creativity? Foresight in Recommendations : Does the summary propose specific and forward- thinking suggestions or future plans? ...
[34]

Integrity and Clarity Clause Coverage : Do the contract provisions comprehensively address all necessary aspects, in- cluding the rights and obligations of both parties, liability for breach, and dispute resolution mech- anisms? Have important details been thoroughly included to avoid omissions? Language Clarity : Is the contract language concise and clea...
[35]

Legality and Risk Control Legal Compliance : Does the contract fully com- ply with relevant laws and regulations, including those related to the qualification of parties, juris- diction, and compensation mechanisms? Has the contract considered specific legal requirements in its respective field, such as labor laws or intellec- tual property laws? Risk Pre...
[36]

Practical Operability Execution Details : Does the contract provide detailed considerations for implementation, cover- ing specific aspects like payment methods, delivery standards, and service quality? Does it offer clear operational guidelines and responsibilities for the performance process? Performance Monitoring : Does the contract include provisions...
[37]

Balance and Fairness Equity Balance : Does the contract reasonably balance the rights and interests of both parties? Does it avoid obviously one-sided terms, such as unfair allocations of liability for breach or overly stringent conditions? Fairness of Design : Are the contract terms structured to reflect fairness and impartiality, effectively reducing th...
[38]

Future Adaptability and Sustainability Flexibility for Adjustment : Does the contract account for potential future changes in circum- stances, such as legal amendments or market fluc- tuations? Does it offer flexible provisions for mod- ifications or adjustments to address unforeseen de- velopments? Long-Term Cooperation Potential : Does the contract safe...
[39]

Linguistic Expression Clarity of Expression : Is the speech language clear, concise, devoid of redundancy, and easy to understand? Are grammar and syntax correct, with varied and layered sentence structures? Appropri- ateness of Language : Does the expression align with the demands of the occasion, employing a for- mal, humorous, or emotional style as nee...
[40]

Emotional Expression and Impact Sincerity of Emotion : Does the speech convey authentic and profound emotions, reflecting the speaker’s genuine attitude? Emotional Resonance : Does the content resonate with the audience, evoke emotional engagement, and fit the tone of different occasions?
[41]

Logical Structure and Coherence Structural Clarity : Is the speech well-structured, with a clear introduction, body, and conclusion? Are key points highlighted, and does the flow of ideas remain coherent? Natural Transitions : Are the transitions between sections logical and smooth, ensuring content flows naturally?
[42]

Suitability for the Occasion Relevance of Content : Does the speech align with the specific theme and atmosphere of the occa- sion (e.g., weddings, memorials)? Audience Con- sideration : Does the speech take into account the audience’s psychology and needs, with language and expression respectful of the context and cul- ture?
[43]

Creativity and Originality Unique Perspective : Does the speech reflect the speaker’s creativity or unique perspective, rather than relying entirely on conventional templates? Memorable Impressions : Are there innovative ex- pressions or distinctive personal elements that leave a lasting impression and highlight the speech’s in- dividuality? H.5 Documentary
[44]

Authenticity and Factual Accuracy Does the work accurately and faithfully reflect historical events or social phenomena, based on thorough investigation and research with reliable sources? Does the work present the complexity of events from multiple perspectives, avoiding bias while maintaining factual rigor?
[45]

Characterization and Emotional Expression Are the characters multidimensional and well- developed, reflecting their inner world and emo- tional changes convincingly? Are the relation- ships between characters intricate and dynamic, contributing to story development, and are the char- acters’ growth or transformations reasonable and compelling?
[46]

Structure and Narrative Techniques Is the overall narrative structure clear and log- ical? Are the plot and pacing engaging and well- balanced, avoiding excessive length or repetitive- ness? Does the work effectively use techniques such as nonlinear timelines, spatial transitions, or shifts in perspective and detail to enhance story- telling and literary quality?
[47]

Ideological Depth and Social Significance Does the work encourage readers to deeply re- flect on social phenomena, historical contexts, or human behaviors, demonstrating a strong sense of social concern? Does it display critical and re- flective perspectives, courageously exposing social issues and engaging in an in-depth exploration of history or society?
[48]

Language and Writing Style Is the language concise, clear, and expressive, employing techniques such as detail, metaphor, or description to enhance literary quality and emo- tional impact? Does the narrative style align with the theme and emotions of the work, enhancing its readability and artistic value? H.6 Essay
[49]

Argument and Depth of Thought Core Argument : Does the review article present a clear and well-defined central argument or po- sition? Does it effectively and directly address the topic or text in question? Depth of Thought : Does the article demonstrate profound insight into the subject or material? Does it employ thorough analysis or critical thinking t...
[50]

Logic and Evidence Clarity of Logic : Is the argument logically co- herent? Is the article well-structured and organized, unfolding its analysis in a systematic and layered manner? Quality of Evidence : Does the article provide strong evidence to support its central ar- gument? Is the evidence thoroughly analyzed and interpreted in a persuasive way?
[51]

Language and Style Language Precision : Is the language used ac- curate, concise, and persuasive? Does it reflect the analytical nature of commentary writing? Dis- tinctive Style : Does the writing style demonstrate critical thinking? Does it reflect the author’s depth of thought and an individualized approach to ex- pression?
[52]

Perspective and Comprehensiveness Multifaceted Analysis : Does the article analyze and interpret the topic or text from multiple per- spectives, reflecting a comprehensive understand- ing of the issue? Comprehensiveness : Does the review integrate various layers of analysis, present- ing a holistic grasp of the subject matter?
[53]

Originality and Thought-Provocation Originality : Does the article present unique in- sights or novel perspectives? Does it offer new ways of thinking or intellectual contributions to the discussion? Thought-Provocation : Does the content of the review inspire further reflection or exploration by the reader? Does it open up new interpretative possibilitie...
[54]

Plot and Structure Plot Coherence : Is the plot well-paced and en- gaging? Does it maintain the reader’s interest? Structural Design : Is the structure of the novel logical? Are there instances of unnecessary delays or plot gaps? For medium- to long-length nov- els, a clear progression (beginning, development, turning points, climax, and resolution) is cr...
[55]

Characterization Character Depth : Are the characters well- developed, multidimensional, and distinct in per- sonality? Character Development : Do the charac- ters undergo meaningful growth, change, or con- flict in a well-reasoned way? Are there clear inter- nal struggles or character arcs? Interpersonal Dy- namics : Are the interactions between characte...
[56]

Themes and Ideas Thematic Depth : Does the novel have a clear theme? Is the theme explored with sufficient depth and intellectual value? Ideological Expression : Does the novel convey profound ideas through char- acters, plot, or symbols? Does it provoke critical thought? Social and Cultural Context : Does the story reflect a nuanced understanding of a pa...
[57]

Language and Prose Style of Expression : Is the author’s language vivid, elegant, and effective in portraying the emo- tions and thoughts of the characters? Contextual Adaptation : Does the language align with the tone and atmosphere of the story? Does it enhance the emotional tension? Detailing : Are the descrip- tions appropriate and well-crafted, contr...
[58]

Emotional Resonance Emotional Impact : Does the novel evoke emo- tional resonance in readers? Does it foster empathy and emotional engagement? Emotional Authen- ticity : Are the emotions in the story realistic and compelling? Do they effectively move the reader?
[59]

Innovation and Distinctiveness Originality : Does the novel exhibit creativity or innovation by breaking away from conventional tropes or styles? Unique Perspective : Does the novel present a distinct viewpoint or approach to ex- ploring its subject matter? Does it convey a strong sense of identity and uniqueness? H.8 Letters
[60]

Structure and Format Does the letter follow standard formatting with appropriate salutation, body, and closing? Is the letter’s structure clear, with distinct paragraphs and a logical flow? Is the letter well-organized and visually appealing, making it easy to read?
[61]

Language Brevity and Clarity Is the language in the letter concise, avoiding long and complex sentences? Is the expression clear, is the logic coherent, and is the information accurate? Are ambiguities and unclear statements avoided to ensure the recipient’s full understand- ing?
[62]

Tone and Attitude Is the tone appropriately chosen based on the recipient’s identity and the letter’s purpose? Does the tone convey sincerity and respect? Does the letter maintain the necessary politeness and profes- sionalism?
[63]

Clear Purpose and Accurate Content Is the core purpose of the letter (e.g., request, notification, suggestion) clearly expressed? Is the content accurate and free from errors or ambiguous expressions? Does the letter stay focused on its goal without deviating from its theme?
[64]

Etiquette and Adaptability Does the letter adhere to basic etiquette norms? Is the language and expression appropriate for the cultural context or situational needs? Is the overall visual presentation of the letter tidy, standardized, and easy to read? H.9 Officials
[65]

Accuracy and Completeness of Content Is the content of the document factual and ac- curate? Does it include all necessary information and details? Is there assurance that no critical parts are omitted? Does it comply with current laws, policies, and regulations?
[66]

Structure and Logical Flow Is the structure of the document clear and reason- able? Is there a good logical connection between paragraphs? Is the sequence of information ar- ranged logically? Does the content flow naturally without redundancy or confusion?
[67]

Language Standardization and Conciseness Does the language conform to formal document standards? Are colloquial expressions avoided? Is the expression precise and rigorous? Is the lan- guage concise and clear, facilitating reader under- standing and execution?
[68]

Formatting and Formality Does the document follow standard formatting? Are sections like type, title, number, date, and sig- natory in compliance with requirements? Is the layout orderly, with correct punctuation and word- ing? Is the overall tone of the document formal and appropriate?
[69]

Executability and Legal Compliance Does the document have clear executable direc- tives? Are the proposed requirements and measures specific and actionable? Does the content comply with laws and regulations? Is there an assurance that it avoids any violations of law or public inter- est? H.10 Plan
[70]

Clarity of Objectives Core Objectives: Does the plan have clearly de- fined goals? Are the objectives measurable and achievable, effectively guiding execution? Detailed Objectives: Does the plan outline problem-specific solutions with well-defined, quantifiable indicators (e.g., percentage of sales growth, training comple- tion rate)?
[71]

Feasibility and Executability Execution Details: Does the plan provide clear operational guidance and a complete implementa- tion process? Are specific implementation steps, timelines, and responsibilities clearly outlined? Ex- ecution Support: Does the plan account for key factors such as resources, personnel, and time dur- ing execution? Does it include...
[72]

Innovation and Differentiation Unique Perspective: Does the plan break con- ventional approaches, offering fresh perspectives or solutions? Does it incorporate novel ideas, meth- ods, or technological support? Innovative Value: Compared to existing plans, does the new plan demonstrate differentiation, effectively addressing issues or offering breakthrough...
[73]

Risk Assessment and Mitigation Measures Risk Identification: Does the plan identify po- tential risks and scenarios that could impact imple- mentation? Mitigation Strategies: Does the plan propose concrete measures or alternative strategies to manage identified risks? Does it account for adaptability in addressing different scenarios?
[74]

Effectiveness Evaluation and Feedback Mech- anism Evaluation Tools: Does the plan include a com- prehensive assessment mechanism to monitor out- comes, provide regular feedback, or track results over time? Optimization Capability: Does the plan incorporate mechanisms for adjustment and itera- tion based on practical feedback to ensure continu- ous improve...
[75]

When evaluating, focus on whether the poem uses distinctive language and effectively conveys rich emotions or ideas succinctly

Language and Expressiveness Innovation and Simplicity: Modern poetry often emphasizes linguistic innovation and unique ex- pressiveness. When evaluating, focus on whether the poem uses distinctive language and effectively conveys rich emotions or ideas succinctly. Rhythm and Sound: Even without traditional rhymes, mod- ern poetry enhances expression throu...
[76]

Evaluation should assess whether the poem possesses philosophical or reflective qualities and whether it provokes thought in the reader

Theme and Depth of Thought Philosophical and Reflective Qualities: Modern poems often explore profound themes such as indi- viduality, society, and existence. Evaluation should assess whether the poem possesses philosophical or reflective qualities and whether it provokes thought in the reader. Uniqueness of Theme and Presen- tation: Attention should be g...
[77]

Evaluation should consider the sincerity of the emo- tions and whether the emotions exhibit complexity or depth

Emotional Expression and Nuance Sincerity and Complexity of Emotion: Modern poetry typically conveys emotions indirectly, using nuanced language, symbolism, and implications. Evaluation should consider the sincerity of the emo- tions and whether the emotions exhibit complexity or depth. Integration of Emotion and Theme: Con- sider whether the emotional ex...
[78]

Evaluation should note whether the poem’s structure is innovative and effectively supports its theme and emotional expres- sion

Uniqueness of Form and Structure Innovative and Organic Structure: Modern po- etry often features diverse structures, including fragmented or non-linear forms. Evaluation should note whether the poem’s structure is innovative and effectively supports its theme and emotional expres- sion. Unity of Form and Content: Modern poetry’s form typically complement...
[79]

Evaluation should consider the poem’s overall ef- fect—whether it resonates emotionally with the reader and stimulates diverse interpretations and re- flections

Overall Effect and Ambiguity Artistic Effect and Interpretative Space: Mod- ern poetry often has openness and ambiguity. Evaluation should consider the poem’s overall ef- fect—whether it resonates emotionally with the reader and stimulates diverse interpretations and re- flections. Impact and Intellectual Provocation: Ul- timately, the evaluation of a mod...
[80]

Theme and Depth of Thought Core Idea : Does the essay present a clear theme or central idea? Does it provoke readers to think deeply? Depth of Thought : Does the essay ex- plore profound philosophical, social, or life-related issues? Does it use detailed descriptions or per- sonal experiences to convey broader reflections?

Showing first 80 references.