Question Type, Cognitive Load, and CEFR Alignment: Evaluating LLM-Generated EFL Grammar Drill Exercises

Brendan Flanagan; Hiroaki Ogata; Steve Woollaston; Yuko Toyokawa

arxiv: 2606.01592 · v2 · pith:E65FIPFRnew · submitted 2026-06-01 · 💻 cs.CY

Question Type, Cognitive Load, and CEFR Alignment: Evaluating LLM-Generated EFL Grammar Drill Exercises

Steve Woollaston , Brendan Flanagan , Yuko Toyokawa , Hiroaki Ogata This is my paper

Pith reviewed 2026-06-28 12:44 UTC · model grok-4.3

classification 💻 cs.CY

keywords LLM-generated contentEFL grammar drillscognitive loadquestion modalitiesCEFR-Jlearner performance logsactive recall

0 comments

The pith

Log data from Japanese students shows multiple-choice grammar questions impose the least cognitive load while cloze tasks block active recall most and drag-and-drop costs the most time, with CEFR-J levels tracking real difficulty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLM-generated EFL grammar drills function as usable classroom material by examining performance logs from Japanese junior high students. It identifies a performance hierarchy across question formats and checks whether the CEFR-J grammar tiers predict the observed accuracy and timing patterns. A sympathetic reader would care because the results indicate LLMs can produce drill content and that modality choice affects the shift from recognition to production. The work therefore supplies empirical guidance on how to arrange question types in sequence.

Core claim

Utilising log data from Japanese junior high school students practicing on a grammar drilling application, the study finds that multiple-choice questions carried the lowest cognitive load, cloze tasks posed the greatest barrier to active recall, and drag-and-drop exercises incurred the heaviest time penalties. Learner data further validated the CEFR-J grammar framework by showing a steady decline in accuracy and increased response times as proficiency levels advanced. These outcomes demonstrate that LLMs can generate viable learning content while underscoring the need to sequence modalities strategically.

What carries the argument

Log data from the grammar drilling application that records accuracy, response time, and inferred cognitive load across question modalities and CEFR-J levels.

If this is right

LLMs can produce pedagogically usable EFL grammar drill content.
Question modalities should be ordered from lowest to highest load to move learners toward active production.
CEFR-J grammar tiers provide a reliable predictor of observed student performance.
Developers gain concrete criteria for selecting and sequencing exercise formats in AI-assisted language tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar log-based comparisons could test whether the same modality hierarchy appears in other languages or age groups.
If interface elements were varied independently, future work could separate modality effects from UI effects more cleanly.
The validated CEFR-J alignment suggests the framework could guide difficulty calibration in other LLM-generated language tasks.

Load-bearing premise

The application's log data isolates cognitive-load differences caused by question modality rather than by interface design, prior exposure, or other factors, and CEFR-J tiers serve as the right benchmark for empirical difficulty.

What would settle it

No measurable difference in accuracy or response time across the three question types, or accuracy and time not declining and rising respectively with advancing CEFR-J levels, would falsify the reported hierarchy and validation.

read the original abstract

This study evaluates the pedagogical viability of LLM-generated English as a Foreign Language (EFL) learning content. Utilising log data from Japanese junior high school students practicing on a grammar drilling application, we analysed how different question modalities impact student performance and whether theoretical localised CEFR difficulty tiers accurately predict empirical task difficulty. Results reveal a clear performance hierarchy: multiple-choice questions carried the lowest cognitive load, cloze tasks posed the greatest barrier to active recall, and drag-and-drop exercises incurred the heaviest time penalties. Furthermore, learner data validated the CEFR-J grammar framework, showing a steady decline in accuracy and increased response times as proficiency levels advanced. These findings demonstrate that LLMs can successfully generate learning content, while highlighting the need for developers to strategically sequence question modalities to transition learners from passive recognition to active linguistic production.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New student log data on LLM-generated EFL drills shows question-type patterns but the cognitive-load hierarchy looks vulnerable to interface and exposure confounds.

read the letter

The main thing to know is that this paper reports fresh log data from Japanese junior high students using an app with LLM-generated grammar exercises. It finds multiple-choice questions easiest on accuracy, cloze tasks hardest for recall, drag-and-drop slowest on time, and a steady drop in performance as CEFR-J levels rise.

What it does is apply existing ideas about question modalities and CEFR tiers to actual usage logs instead of theory or small pilots. That gives a practical data point on whether LLMs can produce usable drill content and how modalities might be sequenced.

The soft spots sit in the interpretation of the logs. Accuracy and time differences could come from pointer precision in drag-and-drop, uneven prior exposure in the app, or classroom effects rather than pure cognitive demand. The abstract gives no sign of randomization, counterbalancing, or regression controls for those factors, and it omits sample size and any statistical tests. If the full methods section does not address this, the claimed hierarchy and CEFR validation rest on weaker footing than presented.

This is for edtech developers working on LLM language apps and researchers using CEFR-J in school settings. A reader focused on practical drill design might extract usable signals.

It deserves peer review so the data handling and controls can be examined directly.

Referee Report

2 major / 1 minor

Summary. The paper evaluates LLM-generated EFL grammar drill exercises using log data from Japanese junior high school students on a grammar app. It claims a performance hierarchy across modalities (MCQ lowest cognitive load, cloze highest active-recall barrier, drag-and-drop highest time penalties) and validates the CEFR-J grammar framework via observed declines in accuracy and increases in response time at higher proficiency tiers. The work concludes that LLMs can generate viable content and that modalities should be sequenced to move learners toward production.

Significance. If the hierarchy and validation claims hold after addressing confounds and adding statistical controls, the results would offer practical guidance for sequencing question types in LLM-driven language tools and provide empirical support for localized CEFR frameworks. The study addresses a timely intersection of generative AI and language pedagogy, but the current evidence base is too preliminary for strong claims about cognitive load or framework validation.

major comments (2)

[Abstract] Abstract: the performance hierarchy (MCQ lowest load, cloze highest recall barrier, drag-and-drop highest time cost) is asserted without reported sample size, statistical tests, confidence intervals, or regression controls for interface variables (e.g., pointer precision, gesture count) or learner history; this leaves open the possibility that observed differences reflect UI friction or differential prior exposure rather than modality-specific cognitive demands.
[Abstract] Abstract: the CEFR-J validation claim (steady decline in accuracy and rise in response time with advancing proficiency) assumes the framework tiers are the appropriate external benchmark, yet provides no description of how learner levels were assigned, no calibration against the observed data distribution, and no comparison to alternative difficulty models.

minor comments (1)

The abstract would be strengthened by explicitly stating the number of students, items, and sessions analyzed, as well as any randomization or counterbalancing procedures used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to incorporate additional statistical details, controls, and clarifications as appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the performance hierarchy (MCQ lowest load, cloze highest recall barrier, drag-and-drop highest time cost) is asserted without reported sample size, statistical tests, confidence intervals, or regression controls for interface variables (e.g., pointer precision, gesture count) or learner history; this leaves open the possibility that observed differences reflect UI friction or differential prior exposure rather than modality-specific cognitive demands.

Authors: The abstract functions as a high-level summary under strict length constraints, with full details on sample size (reported in Section 3.1), statistical tests, and confidence intervals provided in Sections 4.1–4.2 of the manuscript. We agree that explicit controls for potential confounds strengthen the interpretation. In the revision we have added a multiple regression model that controls for interface variables (pointer precision, gesture count) and learner history; the modality hierarchy remains statistically significant after these controls. The abstract has been updated to reference the sample size and the use of regression controls. revision: yes
Referee: [Abstract] Abstract: the CEFR-J validation claim (steady decline in accuracy and rise in response time with advancing proficiency) assumes the framework tiers are the appropriate external benchmark, yet provides no description of how learner levels were assigned, no calibration against the observed data distribution, and no comparison to alternative difficulty models.

Authors: Learner proficiency levels are assigned via the app’s integration with the CEFR-J framework; we have expanded the Methods section (3.2) to describe the assignment procedure and added a calibration check that compares predicted tier difficulty against the empirical accuracy distribution. A systematic comparison against alternative difficulty models lies outside the study’s primary scope, but we have added a short discussion of CEFR-J’s appropriateness for Japanese junior-high EFL contexts. The abstract has been revised to note the validation approach. revision: partial

Circularity Check

0 steps flagged

Empirical observational study with no circular derivations or self-referential claims

full rationale

The paper is an empirical analysis of application log data measuring accuracy and response times across question modalities (MCQ, cloze, drag-and-drop) and CEFR-J levels. No equations, model fits, predictions, or first-principles derivations are presented; the performance hierarchy and framework validation are direct observations from the collected metrics. No self-citation load-bearing steps, ansatzes, or renamings of known results appear in the derivation chain. The study is self-contained against external benchmarks via the logged student interactions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that log-derived accuracy and time metrics validly index cognitive load and that the CEFR-J framework provides an independent difficulty scale; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Student log data from the grammar app accurately captures cognitive load differences due to question type.
The performance hierarchy interpretation depends on this measurement validity.
domain assumption CEFR-J grammar tiers are an appropriate external benchmark for empirical task difficulty.
The validation result treats CEFR-J levels as given and tests alignment against them.

pith-pipeline@v0.9.1-grok · 5677 in / 1352 out tokens · 25532 ms · 2026-06-28T12:44:53.854638+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

D., Savino, G., & Torroni, P

Donati, N., Periani, M., Natale, P. D., Savino, G., & Torroni, P. (2024). Generation and evaluation of English grammar multiple-choice cloze exercises . 325–334. Figueras, N. (2012). The impact of the CEFR. ELT Journal , 66 (4), 477–485. Ishii, Y., & Tono, Y. (2016). A frequency survey of grammar items for the CEFR-J grammar profile. Proceedings of the 22...

work page doi:10.48550/arxiv.2505.02032 2024
[2]

Syahid, A

Computer Assisted Language Learning , 1–38. Syahid, A. (2018). Usability of Moodle question types by EFL teachers. Proceedings of International Conference on English Language Teaching (INACELT) , 2 (1), 224–237. Tono, Y. (2019). Coming Full Circle —From CEFR to CEFR-J and back. CEFR Journal - Research and Practice 1 . https://doi.org/10.37546/jaltsig.cefr...

work page doi:10.37546/jaltsig.cefr1-1 2018

[1] [1]

D., Savino, G., & Torroni, P

Donati, N., Periani, M., Natale, P. D., Savino, G., & Torroni, P. (2024). Generation and evaluation of English grammar multiple-choice cloze exercises . 325–334. Figueras, N. (2012). The impact of the CEFR. ELT Journal , 66 (4), 477–485. Ishii, Y., & Tono, Y. (2016). A frequency survey of grammar items for the CEFR-J grammar profile. Proceedings of the 22...

work page doi:10.48550/arxiv.2505.02032 2024

[2] [2]

Syahid, A

Computer Assisted Language Learning , 1–38. Syahid, A. (2018). Usability of Moodle question types by EFL teachers. Proceedings of International Conference on English Language Teaching (INACELT) , 2 (1), 224–237. Tono, Y. (2019). Coming Full Circle —From CEFR to CEFR-J and back. CEFR Journal - Research and Practice 1 . https://doi.org/10.37546/jaltsig.cefr...

work page doi:10.37546/jaltsig.cefr1-1 2018