Zero-shot Large Language Models for Automatic Readability Assessment

Riley Grossman; Yi Chen

arxiv: 2604.24470 · v1 · submitted 2026-04-27 · 💻 cs.CL

Zero-shot Large Language Models for Automatic Readability Assessment

Riley Grossman , Yi Chen This is my paper

Pith reviewed 2026-05-08 03:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords automatic readability assessmentzero-shot promptinglarge language modelsunsupervised methodsLAURAEtext complexitynatural language processing

0 comments

The pith

Zero-shot prompting lets large language models assess text readability more accurately than prior unsupervised methods on 13 of 14 datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how large language models can judge the readability of text without any prior training on readability data. By using a specific zero-shot prompting method, the models outperform earlier unsupervised techniques on nearly all tested datasets. The authors also introduce LAURAE, a system that blends LLM judgments with traditional readability formulas to better account for both context and basic text features like length. This matters for creating suitable materials in education and medicine, where mismatched readability can hinder understanding. The approach suggests that general-purpose AI models can handle specialized assessment tasks directly.

Core claim

The paper establishes that zero-shot prompting enables large language models to perform automatic readability assessment effectively, surpassing prior methods on 13 of 14 diverse datasets spanning different languages and text lengths. It further demonstrates that ensembling LLM scores with readability formula scores in LAURAE enhances robustness by integrating contextual understanding with shallow linguistic features.

What carries the argument

The zero-shot prompting methodology that elicits readability assessments from LLMs, and LAURAE as the hybrid combination of LLM and formula scores.

If this is right

The method allows unsupervised readability assessment without domain-specific training data.
LAURAE improves performance by capturing both deep semantic and surface-level text properties.
Results generalize across languages, text lengths, and levels of technical content.
Open-source LLMs of varying sizes can serve as effective readability evaluators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could enable real-time readability feedback in writing assistants without custom models.
Future work might test the method on emerging LLM architectures or multimodal texts.
Potential limitations in handling highly specialized jargon could be addressed by domain-specific prompts.

Load-bearing premise

That LLM zero-shot outputs reliably indicate actual readability levels instead of reflecting training artifacts, prompt design, or random variations.

What would settle it

A new dataset from a domain not represented in the 14 tested ones where the proposed method fails to outperform baselines, or where human expert ratings diverge from both LLM and hybrid scores.

Figures

Figures reproduced from arXiv: 2604.24470 by Riley Grossman, Yi Chen.

**Figure 1.** Figure 1: Proposed Zero-shot LLM Method versus Unsupervised ARA Baselines on Benchmark Datasets (Note: view at source ↗

**Figure 2.** Figure 2: LAURAE Ablation Study Results In view at source ↗

**Figure 3.** Figure 3: LAURAE Ablation Study for Shallow Unsupervised ARA Method D Selecting PLM for RSRS For the RSRS method Martinc et al. (2021), we follow test three variants of RSRS5 We implement RSRS with two multilingual BERT-based models: mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020). We also choose one monolingual BERT-based model for each dataset depending on the language: BERT (Devlin et al., 2019) for … view at source ↗

**Figure 4.** Figure 4: Distribution of Verbal Confidence Scores by view at source ↗

read the original abstract

Unsupervised automatic readability assessment (ARA) methods have important practical and research applications (e.g., ensuring medical or educational materials are suitable for their target audiences). In this paper, we propose a new zero-shot prompting methodology for ARA and present the first comprehensive evaluation of using large language models (LLMs) as an unsupervised ARA method by testing 10 diverse open-source LLMs (e.g., different sizes and developers) on 14 diverse datasets (e.g., different text lengths and languages). Our findings show that our proposed prompting methodology outperforms prior methods on 13 of the 14 datasets. Furthermore, we propose LAURAE, which combines LLM and readability formula scores to improve robustness by capturing both contextual and shallow (e.g., sentence length) features of readability. Our evaluation demonstrates that LAURAE robustly outperforms prior methods across languages, text lengths, and amounts of technical language.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Zero-shot LLMs beat formulas on 13/14 datasets and the LAURAE hybrid adds robustness, but missing prompt details and contamination checks limit how much we can trust the gains.

read the letter

The main point is that their zero-shot prompting method gets LLMs to outperform standard readability formulas on 13 of the 14 datasets, and the LAURAE hybrid that mixes LLM scores with formula scores holds up better across languages, lengths, and technical content. They ran the tests on ten open-source models of different sizes and origins, which gives the comparison more weight than the usual narrow evaluations in this area. The hybrid idea itself is straightforward and practical: LLMs pick up context while formulas catch surface signals like sentence length. That combination is the clearest step forward here. The evaluation scope is the strongest part. Covering 14 datasets that vary in domain and language shows the method is not tuned to one narrow setting, and the claim of robustness across those axes is worth noting. The soft spots are real but not fatal. The abstract and available text give no example prompt, no description of how they averaged or stabilized multiple LLM runs, and no membership or cutoff checks on the datasets. Public readability corpora predate most current LLMs, so leakage remains possible and could explain part of the edge. They also skip statistical significance or variance numbers, so the size of the gains is hard to judge. This is aimed at applied people who need unsupervised tools for educational or medical text screening. A reader working on accessibility or ed-tech pipelines will find the multi-dataset numbers useful even if they have to re-implement the prompts themselves. The work shows clear engagement with prior formulas and a practical goal, so it deserves a serious referee. Reviewers will ask for the prompt template, stability tests, and contamination controls, but the breadth of the experiments makes it worth sending out.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a zero-shot prompting methodology for unsupervised automatic readability assessment (ARA) and evaluates 10 open-source LLMs across 14 diverse datasets (varying in length, language, and technical content). It reports that the proposed prompting outperforms prior methods on 13 of the 14 datasets and introduces LAURAE, a hybrid that combines LLM zero-shot scores with traditional readability formulas to capture both contextual and shallow features, claiming improved robustness.

Significance. If the empirical claims are substantiated with full methodological details, the work would offer a practical advance in unsupervised ARA by showing that LLMs can provide contextual signals beyond formula-based approaches, with direct utility for educational and medical text adaptation. The scale of the evaluation (10 models, 14 datasets) is a clear strength that supports generalizability claims across languages and domains.

major comments (2)

[Abstract and Evaluation section] Abstract and Evaluation section: The headline claim of outperformance on 13/14 datasets is presented without reported prompt templates, exact evaluation metrics (e.g., Pearson/Spearman correlation values), statistical significance tests, or variance across multiple LLM sampling seeds/runs. This directly affects verifiability of the robustness and superiority assertions.
[Dataset and LLM sections] Dataset and LLM sections: No membership inference, cutoff-date controls, or contamination checks are described for the 14 public corpora (whose texts predate the evaluated LLMs). This is load-bearing for the central claim that zero-shot outputs measure genuine readability features of the supplied text rather than training-data artifacts.

minor comments (2)

[Abstract] The abstract refers to 'our proposed prompting methodology' without a concise statement of its distinguishing features relative to standard zero-shot templates.
[LAURAE Description] LAURAE combination rule (e.g., weighting scheme, whether weights are fixed or tuned) should be stated explicitly with an equation or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions planned to improve the manuscript.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The headline claim of outperformance on 13/14 datasets is presented without reported prompt templates, exact evaluation metrics (e.g., Pearson/Spearman correlation values), statistical significance tests, or variance across multiple LLM sampling seeds/runs. This directly affects verifiability of the robustness and superiority assertions.

Authors: We appreciate the emphasis on transparency. The manuscript provides the full prompt templates in the appendix and reports Pearson and Spearman correlations in the evaluation tables. To strengthen verifiability as requested, we will revise the abstract to include key metric values, add explicit statistical significance tests, and report variance across multiple sampling runs in the main evaluation section. These changes will be incorporated without altering the underlying results. revision: yes
Referee: [Dataset and LLM sections] Dataset and LLM sections: No membership inference, cutoff-date controls, or contamination checks are described for the 14 public corpora (whose texts predate the evaluated LLMs). This is load-bearing for the central claim that zero-shot outputs measure genuine readability features of the supplied text rather than training-data artifacts.

Authors: This concern is well-taken for interpreting zero-shot claims. The manuscript does not include explicit membership inference or cutoff controls. We will add a discussion of potential contamination risks and their mitigation through model and dataset diversity in a new limitations section, while noting that comprehensive checks lie beyond the current scope. This partial revision will clarify the claim boundaries without requiring new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on external datasets

full rationale

The paper presents an empirical study applying zero-shot prompting to 10 LLMs across 14 datasets, reporting outperformance on 13/14 and gains from the LAURAE ensemble. No equations, derivations, fitted parameters, or first-principles predictions appear in the claims. Performance is measured by direct comparison to ground-truth labels and prior methods on held-out data. No self-definitional steps, load-bearing self-citations, or reductions of results to inputs by construction are present. The methodology and LAURAE combination are evaluated externally without tautological elements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that LLMs can produce meaningful readability judgments from prompts alone and that combining them with surface formulas adds robustness; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption LLMs possess sufficient linguistic knowledge to assess readability in a zero-shot setting when given an appropriate prompt.
Invoked by the proposal of the prompting methodology and the claim of outperformance without task-specific fine-tuning.

invented entities (1)

LAURAE no independent evidence
purpose: Hybrid readability scoring function that fuses LLM output with traditional formula scores.
Newly proposed combination method intended to capture both contextual and shallow features.

pith-pipeline@v0.9.0 · 5440 in / 1356 out tokens · 43448 ms · 2026-05-08T03:42:57.573938+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

DeepSeek-AI, D

Examining linguistic shifts in academic writ- ing before and after the launch of chatgpt: a study on preprint papers.Scientometrics, 130(7):3597 – 3627. Prabin Bhandari and Hannah Brennan. 2023. Trust- worthiness of children stories generated by large language models. InProceedings of the 16th Inter- national Natural Language Generation Conference, pages ...

work page arXiv 2023
[2]

Asso- ciation for Computational Linguistics

InProceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), pages 126–134, Miami, Florida, USA. Asso- ciation for Computational Linguistics. Emre Uysal. 2025. Evaluation of the readability level of the package inserts of topical antifungal drugs. Scientific Reports, 15(1). Sowmya Vajjala and Ivana Lu ˇci´c. 2018....

work page 2024
[3]

InProceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 12–22

Text readability assessment for second lan- guage learners. InProceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 12–22. Xiaoxu Zhang, Bo Wang, and Guangze Liu. 2025. Read- ability of financial reports and stock price crash risk. Finance Research Letters, 86:108489. Kaitlyn Zhou, Dan Jurafsky, and Tatsuno...

work page 2025
[4]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5506–5524, Singapore

Navigating the grey area: How expressions of uncertainty and overconfidence affect language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5506–5524, Singapore. Association for Com- putational Linguistics. Shuqin Zhu, Man Zhang, and Dongdong Guo. 2024. Automatic prediction of text readability for in...

work page arXiv 2023
[5]

Answer with this format: Answer: [SCORE] Confidence: [Confidence Score] Explanation: [EX- PLANATION]. A.2 Arbitrary Readability Scale Prompts All datasets are prompted to generate a readability score based on an arbitrary scale where 1 indicates the text is very easy to understand and 9 indicates the text is very difficult to understand. Several of the be...

work page 2016

[1] [1]

DeepSeek-AI, D

Examining linguistic shifts in academic writ- ing before and after the launch of chatgpt: a study on preprint papers.Scientometrics, 130(7):3597 – 3627. Prabin Bhandari and Hannah Brennan. 2023. Trust- worthiness of children stories generated by large language models. InProceedings of the 16th Inter- national Natural Language Generation Conference, pages ...

work page arXiv 2023

[2] [2]

Asso- ciation for Computational Linguistics

InProceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), pages 126–134, Miami, Florida, USA. Asso- ciation for Computational Linguistics. Emre Uysal. 2025. Evaluation of the readability level of the package inserts of topical antifungal drugs. Scientific Reports, 15(1). Sowmya Vajjala and Ivana Lu ˇci´c. 2018....

work page 2024

[3] [3]

InProceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 12–22

Text readability assessment for second lan- guage learners. InProceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 12–22. Xiaoxu Zhang, Bo Wang, and Guangze Liu. 2025. Read- ability of financial reports and stock price crash risk. Finance Research Letters, 86:108489. Kaitlyn Zhou, Dan Jurafsky, and Tatsuno...

work page 2025

[4] [4]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5506–5524, Singapore

Navigating the grey area: How expressions of uncertainty and overconfidence affect language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5506–5524, Singapore. Association for Com- putational Linguistics. Shuqin Zhu, Man Zhang, and Dongdong Guo. 2024. Automatic prediction of text readability for in...

work page arXiv 2023

[5] [5]

Answer with this format: Answer: [SCORE] Confidence: [Confidence Score] Explanation: [EX- PLANATION]. A.2 Arbitrary Readability Scale Prompts All datasets are prompted to generate a readability score based on an arbitrary scale where 1 indicates the text is very easy to understand and 9 indicates the text is very difficult to understand. Several of the be...

work page 2016