Learning in Blocks: A Multi Agent Debate Assisted Personalized Adaptive Learning Framework for Language Learning

Deepak Subramani; Nicy Scaria; Silvester John Joseph Kennedy

arxiv: 2604.22770 · v1 · submitted 2026-03-29 · 💻 cs.CY · cs.AI· cs.CL· cs.HC

Learning in Blocks: A Multi Agent Debate Assisted Personalized Adaptive Learning Framework for Language Learning

Nicy Scaria , Silvester John Joseph Kennedy , Deepak Subramani This is my paper

Pith reviewed 2026-05-14 22:06 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CLcs.HC

keywords adaptive learningmulti-agent debatelanguage learningCEFR rubricsmastery-based progressionspaced reviewconversational proficiencypersonalized recommendations

0 comments

The pith

Combining multi-agent debate scoring with mastery-based progression and spaced review improves language learning outcomes over feedback alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Learning in Blocks, a framework that evaluates open-ended conversations using CEFR-aligned rubrics instead of discrete quizzes. Heterogeneous multi-agent debate lets role-specialized agents score grammar, vocabulary, and interactive communication separately, debate disagreements, and reach consensus before a judge synthesizes final scores. These scores determine both personalized recommendations for targeted review and whether a learner has reached the 70 percent mastery threshold needed to progress. Spaced review then revisits identified weaknesses to reduce skill decay. An 8-week trial with 180 A2 learners showed this combination produced stronger outcomes than feedback alone, while benchmarking confirmed high agreement with expert human judgments.

Core claim

The framework grounds progression in demonstrated conversational competence evaluated using CEFR-aligned rubrics. It employs heterogeneous multi-agent debate in two stages: a scoring stage where role-specialized agents independently evaluate Grammar, Vocabulary, and Interactive Communication, engage in debate to address conflicting judgments, and a judge synthesizes consensus scores; and a recommendation stage that identifies specific grammar skills and vocabulary topics for targeted review. Progression requires demonstrating 70% mastery, and spaced review targets identified weaknesses to counter skill decay.

What carries the argument

Heterogeneous multi-agent debate (HeteroMAD) protocol that produces CEFR-aligned consensus scores for open-ended conversations through independent agent evaluation followed by structured debate and synthesis.

If this is right

Progression occurs only after a learner demonstrates 70 percent mastery on rubric-scored conversations.
Spaced review is automatically scheduled for the specific grammar and vocabulary weaknesses identified by the scoring stage.
The recommendation stage produces targeted practice items with 90.91 percent acceptability in expert review.
HeteroMAD scoring achieves a 0.23 degree of variation in agreement with expert CEFR annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Digital platforms could replace quiz-driven advancement with conversation-based gates at scale if the scoring protocol holds up across more languages and proficiency levels.
The same debate-and-consensus structure might transfer to other interactive skills such as collaborative problem solving or oral argumentation.
Extending the spaced-review window beyond the eight-week study period could further reduce long-term skill decay in real-world use.
If the framework generalizes, curriculum designers might shift from item banks to libraries of open-ended prompts as the primary learning resource.

Load-bearing premise

The multi-agent debate protocol produces judgments sufficiently reliable and aligned with expert CEFR standards to serve as the sole driver of progression and review decisions.

What would settle it

A larger controlled trial in which learners using the framework show no measurable improvement in conversational proficiency over a feedback-only group, or in which expert human raters assign scores that diverge substantially from the multi-agent consensus on new conversations.

Figures

Figures reproduced from arXiv: 2604.22770 by Deepak Subramani, Nicy Scaria, Silvester John Joseph Kennedy.

**Figure 1.** Figure 1: Learning in Blocks framework within concept block CBt. HeteroMAD produces CEFR-aligned scores and recommendations, triggering review lessons delivered via spaced repetition. Progression to CBt+1 requires mastery attainment 4.1 MAD-Based Scoring and Recommendation The HeteroMAD Pipeline shown in [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Distribution of CEFR-aligned scores at Week 2 (baseline) and Week 8 (postintervention) for three cohorts (N = 60 each). All Week 8 scores are based on Concept Block 7 conversations to ensure a fair comparison across cohorts. Although Cohorts 1 and 2 had completed Block 8 by Week 8, most Cohort 3 learners were at Block 7 (with some reaching Block 8), because progression in this condition required demonstra… view at source ↗

read the original abstract

Most digital language learning curricula rely on discrete-item quizzes that test recall rather than applied conversational proficiency. When progression is driven by quiz performance, learners can advance despite persistent gaps in using grammar and vocabulary during interaction. Recent work on LLM-based judging suggests a path toward scoring open-ended conversations, but using interaction evidence to drive progression and review requires scoring protocols that are reliable and validated. We introduce Learning in Blocks, a framework that grounds progression in demonstrated conversational competence evaluated using CEFR-aligned rubrics. The framework employs heterogeneous multi-agent debate (HeteroMAD) in two stages: a scoring stage where role-specialized agents independently evaluate Grammar, Vocabulary, and Interactive Communication, engage in debate to address conflicting judgments, and a judge synthesizes consensus scores; and a recommendation stage that identifies specific grammar skills and vocabulary topics for targeted review. Progression requires demonstrating 70% mastery, and spaced review targets identified weaknesses to counter skill decay. We benchmark four scoring and recommendation methods on CEFR A2 conversations annotated by ESL experts. HeteroMAD achieves a superior score agreement with a 0.23 degree of variation and recommendation acceptability of 90.91%. An 8-week study with 180 CEFR A2 learners demonstrates that combining rubric-aligned scoring and recommendation with spaced review and mastery-based progression produces better learning outcomes than feedback alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a multi-agent debate system to score conversations on CEFR rubrics and drive mastery-based progression in language learning, with an 8-week study of 180 learners showing gains over basic feedback.

read the letter

The main takeaway is that this work ties heterogeneous multi-agent debate scoring to actual progression rules and targeted review in a language app, rather than stopping at evaluation. The framework runs two stages: agents debate grammar, vocabulary, and interactive communication scores, then a judge produces consensus; recommendations follow for weak spots, with advancement only after 70% mastery and spaced review to prevent decay. The 8-week trial with 180 CEFR A2 learners found better outcomes than feedback alone, and the benchmark hit 90.91% recommendation acceptability against expert annotations. That combination of debate mechanics, rubric alignment, and real-user data is the concrete advance over earlier LLM judging papers. The study design itself is a plus because it tests the full loop on actual learners instead of synthetic data. The soft spot is the scoring reliability claim. The abstract calls the agreement a 0.23 degree of variation, which is too vague to judge. If that number reflects disagreement or error spread, it could create noisy mastery calls at the 70% threshold and make it hard to attribute gains cleanly to the framework. Full methods need per-dimension stats, confusion matrices, and a sensitivity check at the cutoff before the progression logic looks solid. The stress-test concern about misclassifications is reasonable until those numbers appear. This paper is for people building or studying LLM tutors for languages who want a worked example of multi-agent scoring feeding adaptive rules. Readers focused on practical deployment will get the most from the architecture and the learner outcomes. It deserves peer review because the empirical piece is present and the gap it targets is documented, even if the reliability details require tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Learning in Blocks framework for personalized adaptive language learning. It employs heterogeneous multi-agent debate (HeteroMAD) to score open-ended conversations using CEFR-aligned rubrics on grammar, vocabulary, and interactive communication, with a judge synthesizing consensus scores. Progression is gated at a 70% mastery threshold, and the system generates targeted review recommendations. Benchmarks on expert-annotated A2 conversations report 0.23 variation in score agreement and 90.91% recommendation acceptability for HeteroMAD; an 8-week study with 180 CEFR A2 learners claims superior outcomes versus feedback alone when combining rubric scoring, spaced review, and mastery-based progression.

Significance. If the scoring protocol is shown to be sufficiently reliable, the work could meaningfully advance adaptive language systems by replacing discrete-item quizzes with conversational proficiency assessment, enabling more accurate targeting of skill gaps and countering decay through spaced review. The external grounding in expert annotations and separate learner study avoids circularity, which strengthens potential impact if the reliability concerns are resolved.

major comments (2)

[Abstract] Abstract: The claim of 'superior score agreement with a 0.23 degree of variation' is load-bearing for the central claim that HeteroMAD can drive progression and recommendations without expert oversight. The metric is unspecified (Cohen's kappa, standard deviation, or other?), and at the 70% mastery threshold this level of noise risks frequent misclassifications of learner competence. Full methods must provide per-dimension inter-rater statistics, confusion matrices, and sensitivity analysis at the cutoff to support internal validity.
[Abstract] The 8-week study with 180 learners is presented as demonstrating better outcomes from the combined framework, yet the abstract provides no statistical tests, effect sizes, inter-rater reliability for the deployed scoring, or data exclusion rules. Without these, attribution of gains to rubric-aligned scoring plus spaced review rather than measurement error cannot be verified.

minor comments (2)

[Abstract] Clarify the exact definition and implementation of the 70% mastery threshold and how it aggregates across the three rubric dimensions.
[Abstract] The term 'HeteroMAD' is introduced without a clear expansion or diagram of the agent roles and debate protocol in the provided abstract; a methods figure would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We address each major point below and will incorporate revisions to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of 'superior score agreement with a 0.23 degree of variation' is load-bearing for the central claim that HeteroMAD can drive progression and recommendations without expert oversight. The metric is unspecified (Cohen's kappa, standard deviation, or other?), and at the 70% mastery threshold this level of noise risks frequent misclassifications of learner competence. Full methods must provide per-dimension inter-rater statistics, confusion matrices, and sensitivity analysis at the cutoff to support internal validity.

Authors: We agree the abstract phrasing is ambiguous and will revise it to specify that the 0.23 value is the standard deviation of absolute score differences between HeteroMAD and expert annotations across dimensions. The full Methods section already contains per-dimension agreement metrics and will be expanded with confusion matrices for each rubric (Grammar, Vocabulary, Interactive Communication) plus a sensitivity analysis of classification stability around the 70% threshold. These additions will be highlighted in the revision. revision: yes
Referee: [Abstract] The 8-week study with 180 learners is presented as demonstrating better outcomes from the combined framework, yet the abstract provides no statistical tests, effect sizes, inter-rater reliability for the deployed scoring, or data exclusion rules. Without these, attribution of gains to rubric-aligned scoring plus spaced review rather than measurement error cannot be verified.

Authors: The Results section reports the relevant statistical tests (independent-samples t-tests on post-test scores and retention measures) together with effect sizes and the inter-rater agreement of the deployed HeteroMAD system against expert annotations on a validation subset. Data exclusion criteria (participants completing fewer than 70% of sessions) are stated in Methods. We will add a concise summary of the key statistical results and effect sizes to the abstract and ensure explicit cross-references to the full reliability and exclusion details. revision: yes

Circularity Check

0 steps flagged

No circularity: external validation and empirical study

full rationale

The paper describes a framework (HeteroMAD scoring plus spaced review) whose central claims rest on benchmarking against independent ESL expert annotations of CEFR A2 conversations and on an 8-week randomized study with 180 separate learners. No equations, fitted parameters, or self-citations are presented that reduce any reported outcome or progression rule to the same data by construction. The 0.23 agreement figure and 90.91% recommendation acceptability are stated as measured quantities against external labels; the learning-outcome comparison is likewise an empirical result, not a definitional identity. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Framework rests on the assumption that CEFR rubrics can be reliably applied to AI-generated conversation transcripts and that a 70% mastery threshold is educationally meaningful; no new physical entities are postulated.

free parameters (1)

70% mastery threshold
Arbitrary cutoff chosen to decide progression; directly controls when learners advance or receive review.

axioms (1)

domain assumption CEFR rubrics provide valid and consistent criteria for scoring conversational grammar, vocabulary, and interactive communication
Invoked throughout scoring stage and benchmark evaluation

invented entities (1)

HeteroMAD (heterogeneous multi-agent debate) no independent evidence
purpose: Role-specialized agents that independently score and then debate to produce consensus CEFR scores
New procedural construct introduced for the scoring stage

pith-pipeline@v0.9.0 · 5550 in / 1272 out tokens · 42323 ms · 2026-05-14T22:06:01.735453+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HeteroMAD achieves a superior score agreement with a 0.23 degree of variation and recommendation acceptability of 90.91%. An 8-week study with 180 CEFR A2 learners
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Progression requires demonstrating 70% mastery

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 3 internal anchors

[1]

ACM Com- put

Abdelrahman, G., Wang, Q., Nunes, B.: Knowledge tracing: A survey. ACM Com- put. Surv.55(11) (2023)

work page 2023
[2]

gpt-oss-120b & gpt-oss-20b Model Card

Agarwal, S., Ahmad, L., Ai, J., Altman, S., Applebaum, A., Arbus, E., Arora, R.K., Bai, Y., Baker, B., Bao, H., et al.: gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925 (2025) 14 N. Scaria et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

instruction and curriculum

Bloom, B.S.: Learning for mastery. instruction and curriculum. regional education laboratory for the carolinas and virginia, topical papers and reprints, number 1. Evaluation comment1(2), n2 (1968)

work page 1968
[4]

Psychological bulletin 132(3), 354 (2006)

Cepeda, N.J., Pashler, H., Vul, E., Wixted, J.T., Rohrer, D.: Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological bulletin 132(3), 354 (2006)

work page 2006
[5]

Choi, H.K., Zhu, J., Li, S.: Debate or vote: Which yields better decisions in multi- agent large language models? In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

work page 2025
[6]

Psychological bulletin70(4), 213 (1968)

Cohen, J.: Weighted kappa: nominal scale agreement provision for scaled disagree- ment or partial credit. Psychological bulletin70(4), 213 (1968)

work page 1968
[7]

coe.int/en/web/common-european-framework-reference-languages(2020)

Council of Europe: Council of europe common european framework of reference for languages: learning, teaching, assessment – companion volume.https://www. coe.int/en/web/common-european-framework-reference-languages(2020)

work page 2020
[8]

In: Proceedings of the 41st International Conference on Machine Learning

Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factual- ity and reasoning in language models through multiagent debate. In: Proceedings of the 41st International Conference on Machine Learning. ICML’24, JMLR.org (2024)

work page 2024
[9]

Teachers College, Columbia University, New York (1913), english translation ofÜber das Gedächtnis(1885)

Ebbinghaus, H.: Memory: A Contribution to Experimental Psychology. Teachers College, Columbia University, New York (1913), english translation ofÜber das Gedächtnis(1885)

work page 1913
[10]

Implicit and explicit lan- guage learning: Conditions, processes, and knowledge in SLA and bilingualism35, 47 (2011)

Ellis, N.C.: Implicit and explicit sla and their interface. Implicit and explicit lan- guage learning: Conditions, processes, and knowledge in SLA and bilingualism35, 47 (2011)

work page 2011
[11]

Cambridge English Qualifications (2024)

English, C.A.: A2 key for schools handbook for teachers for exams. Cambridge English Qualifications (2024)

work page 2024
[12]

Educational Leadership68(2), 52–57 (2010)

Guskey, T.R.: Lessons of mastery learning. Educational Leadership68(2), 52–57 (2010)

work page 2010
[13]

In: Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers)

Hashemi, H., Eisner, J., Rosset, C., Van Durme, B., Kedzie, C.: LLM-rubric: A multidimensional,calibratedapproachtoautomatedevaluationofnaturallanguage texts. In: Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers). pp. 13806–13834. Bangkok, Thailand (Aug 2024)

work page 2024
[14]

In: International Conference on Artificial Intelligence in Education

Hou, X., Forsyth, C., Andrews-Todd, J., Rice, J., Cai, Z., Jiang, Y., Zapata-Rivera, D., Graesser, A.: An llm-enhanced multi-agent architecture for conversation-based assessment. In: International Conference on Artificial Intelligence in Education. pp. 119–134. Springer (2025)

work page 2025
[15]

TESOL Quarterly (2025)

Karatay,Y.,Xu,J.:Exploringthepotentialofconversationalaiforassessingsecond language oral proficiency. TESOL Quarterly (2025)

work page 2025
[16]

Learning and individual differences103, 102274 (2023)

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., et al.: Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences103, 102274 (2023)

work page 2023
[17]

Hachette UK (2012)

Khan, S.: The one world schoolhouse: Education reimagined. Hachette UK (2012)

work page 2012
[18]

Cognitive science36(5), 757–798 (2012)

Koedinger, K.R., Corbett, A.T., Perfetti, C.: The knowledge-learning-instruction framework: Bridging the science-practice chasm to enhance robust student learn- ing. Cognitive science36(5), 757–798 (2012)

work page 2012
[19]

From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,

Li, D., Jiang, B., Huang, L., Beigi, A., Zhao, C., Tan, Z., Bhattacharjee, A., Jiang, Y., Chen, C., Wu, T., et al.: From generation to judgment: Opportunities and challenges of llm-as-a-judge. arXiv preprint arXiv:2411.16594 (2024) Learning in Blocks15

work page arXiv 2024
[20]

Journal of English for Academic Purposes75, 101505 (2025)

Liu, X.J., Wang, J., Zou, B.: Evaluating an ai speaking assessment tool: Score accu- racy, perceived validity, and oral peer feedback as feedback enhancement. Journal of English for Academic Purposes75, 101505 (2025)

work page 2025
[21]

NeurIPS36, 46534–46594 (2023)

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., et al.: Self-refine: Iterative refinement with self-feedback. NeurIPS36, 46534–46594 (2023)

work page 2023
[22]

Journal of Educational Data Mining17(1), 308–336 (2025)

Matayoshi, J., Cosyn, E., Uzun, H., Kurd-Misto, E., et al.: Using a randomized experiment to compare mastery learning thresholds. Journal of Educational Data Mining17(1), 308–336 (2025)

work page 2025
[23]

In: Proceedings of the 2023 chi conference on human factors in computing systems

Pardos, Z.A., Tang, M., Anastasopoulos, I., Sheel, S.K., Zhang, E.: Oatutor: An open-source adaptive tutoring system and curated content library for learning sci- ences research. In: Proceedings of the 2023 chi conference on human factors in computing systems. pp. 1–17 (2023)

work page 2023
[24]

In: International conference on machine learning

Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision. In: International conference on machine learning. pp. 28492–28518. PMLR (2023)

work page 2023
[25]

Computers and Education: Artificial Intelligence p

Scaria, N., Kennedy, S.J.J., Latinovich, T., Subramani, D.: Evalyaks: Instruction tuning datasets and lora fine-tuned models for automated scoring of cefr b2 speak- ing assessment transcripts. Computers and Education: Artificial Intelligence p. 100539 (2025)

work page 2025
[26]

IEEE Transactions on Learning Technologies17, 1858–1879 (2024)

Shen, S., Liu, Q., Huang, Z., Zheng, Y., Yin, M., Wang, M., Chen, E.: A survey of knowledge tracing: Models, variants, and applications. IEEE Transactions on Learning Technologies17, 1858–1879 (2024)

work page 2024
[27]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

In: ICLR (2023)

Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. In: ICLR (2023)

work page 2023
[29]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Computers and Education: Artificial Intelligence7, 100280 (2024)

Yuhana, U.L., Djunaidy, A., Purnomo, M.H., et al.: Enhancing students perfor- mance through dynamic personalized learning path using ant colony and item re- sponse theory (acoirt). Computers and Education: Artificial Intelligence7, 100280 (2024)

work page 2024
[31]

NeurIPS36, 46595–46623 (2023)

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. NeurIPS36, 46595–46623 (2023)

work page 2023

[1] [1]

ACM Com- put

Abdelrahman, G., Wang, Q., Nunes, B.: Knowledge tracing: A survey. ACM Com- put. Surv.55(11) (2023)

work page 2023

[2] [2]

gpt-oss-120b & gpt-oss-20b Model Card

Agarwal, S., Ahmad, L., Ai, J., Altman, S., Applebaum, A., Arbus, E., Arora, R.K., Bai, Y., Baker, B., Bao, H., et al.: gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925 (2025) 14 N. Scaria et al

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

instruction and curriculum

Bloom, B.S.: Learning for mastery. instruction and curriculum. regional education laboratory for the carolinas and virginia, topical papers and reprints, number 1. Evaluation comment1(2), n2 (1968)

work page 1968

[4] [4]

Psychological bulletin 132(3), 354 (2006)

Cepeda, N.J., Pashler, H., Vul, E., Wixted, J.T., Rohrer, D.: Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological bulletin 132(3), 354 (2006)

work page 2006

[5] [5]

Choi, H.K., Zhu, J., Li, S.: Debate or vote: Which yields better decisions in multi- agent large language models? In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

work page 2025

[6] [6]

Psychological bulletin70(4), 213 (1968)

Cohen, J.: Weighted kappa: nominal scale agreement provision for scaled disagree- ment or partial credit. Psychological bulletin70(4), 213 (1968)

work page 1968

[7] [7]

coe.int/en/web/common-european-framework-reference-languages(2020)

Council of Europe: Council of europe common european framework of reference for languages: learning, teaching, assessment – companion volume.https://www. coe.int/en/web/common-european-framework-reference-languages(2020)

work page 2020

[8] [8]

In: Proceedings of the 41st International Conference on Machine Learning

Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factual- ity and reasoning in language models through multiagent debate. In: Proceedings of the 41st International Conference on Machine Learning. ICML’24, JMLR.org (2024)

work page 2024

[9] [9]

Teachers College, Columbia University, New York (1913), english translation ofÜber das Gedächtnis(1885)

Ebbinghaus, H.: Memory: A Contribution to Experimental Psychology. Teachers College, Columbia University, New York (1913), english translation ofÜber das Gedächtnis(1885)

work page 1913

[10] [10]

Implicit and explicit lan- guage learning: Conditions, processes, and knowledge in SLA and bilingualism35, 47 (2011)

Ellis, N.C.: Implicit and explicit sla and their interface. Implicit and explicit lan- guage learning: Conditions, processes, and knowledge in SLA and bilingualism35, 47 (2011)

work page 2011

[11] [11]

Cambridge English Qualifications (2024)

English, C.A.: A2 key for schools handbook for teachers for exams. Cambridge English Qualifications (2024)

work page 2024

[12] [12]

Educational Leadership68(2), 52–57 (2010)

Guskey, T.R.: Lessons of mastery learning. Educational Leadership68(2), 52–57 (2010)

work page 2010

[13] [13]

In: Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers)

Hashemi, H., Eisner, J., Rosset, C., Van Durme, B., Kedzie, C.: LLM-rubric: A multidimensional,calibratedapproachtoautomatedevaluationofnaturallanguage texts. In: Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers). pp. 13806–13834. Bangkok, Thailand (Aug 2024)

work page 2024

[14] [14]

In: International Conference on Artificial Intelligence in Education

Hou, X., Forsyth, C., Andrews-Todd, J., Rice, J., Cai, Z., Jiang, Y., Zapata-Rivera, D., Graesser, A.: An llm-enhanced multi-agent architecture for conversation-based assessment. In: International Conference on Artificial Intelligence in Education. pp. 119–134. Springer (2025)

work page 2025

[15] [15]

TESOL Quarterly (2025)

Karatay,Y.,Xu,J.:Exploringthepotentialofconversationalaiforassessingsecond language oral proficiency. TESOL Quarterly (2025)

work page 2025

[16] [16]

Learning and individual differences103, 102274 (2023)

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., et al.: Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences103, 102274 (2023)

work page 2023

[17] [17]

Hachette UK (2012)

Khan, S.: The one world schoolhouse: Education reimagined. Hachette UK (2012)

work page 2012

[18] [18]

Cognitive science36(5), 757–798 (2012)

Koedinger, K.R., Corbett, A.T., Perfetti, C.: The knowledge-learning-instruction framework: Bridging the science-practice chasm to enhance robust student learn- ing. Cognitive science36(5), 757–798 (2012)

work page 2012

[19] [19]

From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,

Li, D., Jiang, B., Huang, L., Beigi, A., Zhao, C., Tan, Z., Bhattacharjee, A., Jiang, Y., Chen, C., Wu, T., et al.: From generation to judgment: Opportunities and challenges of llm-as-a-judge. arXiv preprint arXiv:2411.16594 (2024) Learning in Blocks15

work page arXiv 2024

[20] [20]

Journal of English for Academic Purposes75, 101505 (2025)

Liu, X.J., Wang, J., Zou, B.: Evaluating an ai speaking assessment tool: Score accu- racy, perceived validity, and oral peer feedback as feedback enhancement. Journal of English for Academic Purposes75, 101505 (2025)

work page 2025

[21] [21]

NeurIPS36, 46534–46594 (2023)

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., et al.: Self-refine: Iterative refinement with self-feedback. NeurIPS36, 46534–46594 (2023)

work page 2023

[22] [22]

Journal of Educational Data Mining17(1), 308–336 (2025)

Matayoshi, J., Cosyn, E., Uzun, H., Kurd-Misto, E., et al.: Using a randomized experiment to compare mastery learning thresholds. Journal of Educational Data Mining17(1), 308–336 (2025)

work page 2025

[23] [23]

In: Proceedings of the 2023 chi conference on human factors in computing systems

Pardos, Z.A., Tang, M., Anastasopoulos, I., Sheel, S.K., Zhang, E.: Oatutor: An open-source adaptive tutoring system and curated content library for learning sci- ences research. In: Proceedings of the 2023 chi conference on human factors in computing systems. pp. 1–17 (2023)

work page 2023

[24] [24]

In: International conference on machine learning

Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision. In: International conference on machine learning. pp. 28492–28518. PMLR (2023)

work page 2023

[25] [25]

Computers and Education: Artificial Intelligence p

Scaria, N., Kennedy, S.J.J., Latinovich, T., Subramani, D.: Evalyaks: Instruction tuning datasets and lora fine-tuned models for automated scoring of cefr b2 speak- ing assessment transcripts. Computers and Education: Artificial Intelligence p. 100539 (2025)

work page 2025

[26] [26]

IEEE Transactions on Learning Technologies17, 1858–1879 (2024)

Shen, S., Liu, Q., Huang, Z., Zheng, Y., Yin, M., Wang, M., Chen, E.: A survey of knowledge tracing: Models, variants, and applications. IEEE Transactions on Learning Technologies17, 1858–1879 (2024)

work page 2024

[27] [27]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

In: ICLR (2023)

Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. In: ICLR (2023)

work page 2023

[29] [29]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Computers and Education: Artificial Intelligence7, 100280 (2024)

Yuhana, U.L., Djunaidy, A., Purnomo, M.H., et al.: Enhancing students perfor- mance through dynamic personalized learning path using ant colony and item re- sponse theory (acoirt). Computers and Education: Artificial Intelligence7, 100280 (2024)

work page 2024

[31] [31]

NeurIPS36, 46595–46623 (2023)

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. NeurIPS36, 46595–46623 (2023)

work page 2023