Zero-shot Large Language Models for Automatic Readability Assessment
Pith reviewed 2026-05-08 03:42 UTC · model grok-4.3
The pith
Zero-shot prompting lets large language models assess text readability more accurately than prior unsupervised methods on 13 of 14 datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that zero-shot prompting enables large language models to perform automatic readability assessment effectively, surpassing prior methods on 13 of 14 diverse datasets spanning different languages and text lengths. It further demonstrates that ensembling LLM scores with readability formula scores in LAURAE enhances robustness by integrating contextual understanding with shallow linguistic features.
What carries the argument
The zero-shot prompting methodology that elicits readability assessments from LLMs, and LAURAE as the hybrid combination of LLM and formula scores.
If this is right
- The method allows unsupervised readability assessment without domain-specific training data.
- LAURAE improves performance by capturing both deep semantic and surface-level text properties.
- Results generalize across languages, text lengths, and levels of technical content.
- Open-source LLMs of varying sizes can serve as effective readability evaluators.
Where Pith is reading between the lines
- This could enable real-time readability feedback in writing assistants without custom models.
- Future work might test the method on emerging LLM architectures or multimodal texts.
- Potential limitations in handling highly specialized jargon could be addressed by domain-specific prompts.
Load-bearing premise
That LLM zero-shot outputs reliably indicate actual readability levels instead of reflecting training artifacts, prompt design, or random variations.
What would settle it
A new dataset from a domain not represented in the 14 tested ones where the proposed method fails to outperform baselines, or where human expert ratings diverge from both LLM and hybrid scores.
Figures
read the original abstract
Unsupervised automatic readability assessment (ARA) methods have important practical and research applications (e.g., ensuring medical or educational materials are suitable for their target audiences). In this paper, we propose a new zero-shot prompting methodology for ARA and present the first comprehensive evaluation of using large language models (LLMs) as an unsupervised ARA method by testing 10 diverse open-source LLMs (e.g., different sizes and developers) on 14 diverse datasets (e.g., different text lengths and languages). Our findings show that our proposed prompting methodology outperforms prior methods on 13 of the 14 datasets. Furthermore, we propose LAURAE, which combines LLM and readability formula scores to improve robustness by capturing both contextual and shallow (e.g., sentence length) features of readability. Our evaluation demonstrates that LAURAE robustly outperforms prior methods across languages, text lengths, and amounts of technical language.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a zero-shot prompting methodology for unsupervised automatic readability assessment (ARA) and evaluates 10 open-source LLMs across 14 diverse datasets (varying in length, language, and technical content). It reports that the proposed prompting outperforms prior methods on 13 of the 14 datasets and introduces LAURAE, a hybrid that combines LLM zero-shot scores with traditional readability formulas to capture both contextual and shallow features, claiming improved robustness.
Significance. If the empirical claims are substantiated with full methodological details, the work would offer a practical advance in unsupervised ARA by showing that LLMs can provide contextual signals beyond formula-based approaches, with direct utility for educational and medical text adaptation. The scale of the evaluation (10 models, 14 datasets) is a clear strength that supports generalizability claims across languages and domains.
major comments (2)
- [Abstract and Evaluation section] Abstract and Evaluation section: The headline claim of outperformance on 13/14 datasets is presented without reported prompt templates, exact evaluation metrics (e.g., Pearson/Spearman correlation values), statistical significance tests, or variance across multiple LLM sampling seeds/runs. This directly affects verifiability of the robustness and superiority assertions.
- [Dataset and LLM sections] Dataset and LLM sections: No membership inference, cutoff-date controls, or contamination checks are described for the 14 public corpora (whose texts predate the evaluated LLMs). This is load-bearing for the central claim that zero-shot outputs measure genuine readability features of the supplied text rather than training-data artifacts.
minor comments (2)
- [Abstract] The abstract refers to 'our proposed prompting methodology' without a concise statement of its distinguishing features relative to standard zero-shot templates.
- [LAURAE Description] LAURAE combination rule (e.g., weighting scheme, whether weights are fixed or tuned) should be stated explicitly with an equation or pseudocode.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions planned to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The headline claim of outperformance on 13/14 datasets is presented without reported prompt templates, exact evaluation metrics (e.g., Pearson/Spearman correlation values), statistical significance tests, or variance across multiple LLM sampling seeds/runs. This directly affects verifiability of the robustness and superiority assertions.
Authors: We appreciate the emphasis on transparency. The manuscript provides the full prompt templates in the appendix and reports Pearson and Spearman correlations in the evaluation tables. To strengthen verifiability as requested, we will revise the abstract to include key metric values, add explicit statistical significance tests, and report variance across multiple sampling runs in the main evaluation section. These changes will be incorporated without altering the underlying results. revision: yes
-
Referee: [Dataset and LLM sections] Dataset and LLM sections: No membership inference, cutoff-date controls, or contamination checks are described for the 14 public corpora (whose texts predate the evaluated LLMs). This is load-bearing for the central claim that zero-shot outputs measure genuine readability features of the supplied text rather than training-data artifacts.
Authors: This concern is well-taken for interpreting zero-shot claims. The manuscript does not include explicit membership inference or cutoff controls. We will add a discussion of potential contamination risks and their mitigation through model and dataset diversity in a new limitations section, while noting that comprehensive checks lie beyond the current scope. This partial revision will clarify the claim boundaries without requiring new experiments. revision: partial
Circularity Check
No significant circularity; empirical evaluation on external datasets
full rationale
The paper presents an empirical study applying zero-shot prompting to 10 LLMs across 14 datasets, reporting outperformance on 13/14 and gains from the LAURAE ensemble. No equations, derivations, fitted parameters, or first-principles predictions appear in the claims. Performance is measured by direct comparison to ground-truth labels and prior methods on held-out data. No self-definitional steps, load-bearing self-citations, or reductions of results to inputs by construction are present. The methodology and LAURAE combination are evaluated externally without tautological elements.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs possess sufficient linguistic knowledge to assess readability in a zero-shot setting when given an appropriate prompt.
invented entities (1)
-
LAURAE
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Examining linguistic shifts in academic writ- ing before and after the launch of chatgpt: a study on preprint papers.Scientometrics, 130(7):3597 – 3627. Prabin Bhandari and Hannah Brennan. 2023. Trust- worthiness of children stories generated by large language models. InProceedings of the 16th Inter- national Natural Language Generation Conference, pages ...
-
[2]
Asso- ciation for Computational Linguistics
InProceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), pages 126–134, Miami, Florida, USA. Asso- ciation for Computational Linguistics. Emre Uysal. 2025. Evaluation of the readability level of the package inserts of topical antifungal drugs. Scientific Reports, 15(1). Sowmya Vajjala and Ivana Lu ˇci´c. 2018....
work page 2024
-
[3]
Text readability assessment for second lan- guage learners. InProceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 12–22. Xiaoxu Zhang, Bo Wang, and Guangze Liu. 2025. Read- ability of financial reports and stock price crash risk. Finance Research Letters, 86:108489. Kaitlyn Zhou, Dan Jurafsky, and Tatsuno...
work page 2025
-
[4]
Navigating the grey area: How expressions of uncertainty and overconfidence affect language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5506–5524, Singapore. Association for Com- putational Linguistics. Shuqin Zhu, Man Zhang, and Dongdong Guo. 2024. Automatic prediction of text readability for in...
-
[5]
Answer with this format: Answer: [SCORE] Confidence: [Confidence Score] Explanation: [EX- PLANATION]. A.2 Arbitrary Readability Scale Prompts All datasets are prompted to generate a readability score based on an arbitrary scale where 1 indicates the text is very easy to understand and 9 indicates the text is very difficult to understand. Several of the be...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.