LLAMA LIMA: A Living Meta-Analysis on the Effects of Generative AI on Learning Mathematics
Pith reviewed 2026-05-25 07:24 UTC · model grok-4.3
The pith
Generative AI shows a moderate positive effect on mathematics learning, larger when it complements rather than replaces teachers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors report from the third update of their living meta-analysis that generative AI-based interventions produce a positive effect on mathematics learning outcomes with g = 0.40 and a credible interval of [0.14, 0.67]. They find no indication of publication bias across the included studies and moderate evidence that the benefits are larger when the AI is used to complement regular instruction rather than to replace teachers.
What carries the argument
A Bayesian multilevel meta-regression applied to nested, accumulating data from studies that meet PRISMA-LSR criteria, with periodic preprint updates.
If this is right
- Generative AI can be expected to improve math learning outcomes on average when added to existing instruction.
- Replacement of teachers by AI alone is likely to produce smaller gains than complementary use.
- The lack of detected publication bias supports treating the aggregated effect size as reliable for the current evidence base.
- Continued updates to the meta-analysis will allow the effect estimate and moderator conclusions to be refined as more data arrive.
Where Pith is reading between the lines
- The wide credible interval signals that future studies should target specific age groups or math topics to tighten the estimate.
- The living-review format could be copied for other subjects where AI tools are spreading quickly.
- School systems might test hybrid models that keep teachers in charge while adding AI support, using the moderator result as a starting hypothesis.
Load-bearing premise
The 24 studies form a representative and sufficiently unbiased sample of generative AI interventions in mathematics education.
What would settle it
A new wave of large, well-designed studies that shift the credible interval to include zero or negative values would undermine the claim of a positive overall effect.
Figures
read the original abstract
The capabilities of generative AI in mathematics education are rapidly evolving, posing significant challenges for research to keep pace. Research syntheses remain scarce and risk being outdated by the time of publication. To address this issue, we present a Living Meta-Analysis (LIMA) on the effects of generative AI-based interventions for learning mathematics. Following PRISMA-LSR guidelines, we continuously update the literature base, apply a Bayesian multilevel meta-regression model to account for nested and cumulative data, and publish updated versions on a preprint server at regular intervals. This paper reports results from the third version, including 24 studies, 3 of which were newly included since the second version. The analyses indicate a positive effect (g = 0.40) with a wide credible interval [0.14, 0.67], reflecting the still limited evidence base. Results indicate no publication bias. Moderator analyses indicate moderate evidence that generative AI is more beneficial when it complements regular instruction rather than replacing teachers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the third version of LLAMA LIMA, a living meta-analysis following PRISMA-LSR guidelines that continuously updates a Bayesian multilevel meta-regression synthesizing evidence on generative AI interventions for mathematics learning. With 24 studies (3 newly added), it reports an overall positive effect (g = 0.40, 95% CrI [0.14, 0.67]), no publication bias, and moderate evidence from moderator analyses that generative AI is more beneficial when complementing rather than replacing regular instruction.
Significance. If the results hold after addressing transparency issues, this provides a timely, dynamically updated synthesis of an emerging area, with the living format and Bayesian approach offering strengths in handling cumulative data and uncertainty. The explicit acknowledgment of the limited evidence base and wide credible interval is a credit to the work's caution.
major comments (2)
- [Methods] Methods section: The description of the Bayesian multilevel meta-regression provides no model equation, prior specifications, variance component details, or coding of the complementarity moderator, which are load-bearing for assessing the stability of the moderator posterior given only 24 studies and the wide overall CrI.
- [Results] Results section (moderator analyses): The claim of 'moderate evidence' for complementarity over replacement depends on the 24 studies forming a representative sample without confounding (e.g., by study quality or outcome type); no sensitivity checks or inclusion criteria details are reported to support this, undermining the moderator finding.
minor comments (1)
- [Abstract] The abstract and title introduce 'LLAMA LIMA' without a clear expansion of the acronym on first use.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which identifies key areas where greater transparency will strengthen the manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [Methods] Methods section: The description of the Bayesian multilevel meta-regression provides no model equation, prior specifications, variance component details, or coding of the complementarity moderator, which are load-bearing for assessing the stability of the moderator posterior given only 24 studies and the wide overall CrI.
Authors: We agree that the Methods section requires additional detail to allow readers to evaluate the model. In the revised manuscript we will add the model equation, specify the priors (weakly informative normal and half-Cauchy distributions), report the variance components, and describe the binary coding of the complementarity moderator. These changes directly address the concern about assessing stability with only 24 studies. revision: yes
-
Referee: [Results] Results section (moderator analyses): The claim of 'moderate evidence' for complementarity over replacement depends on the 24 studies forming a representative sample without confounding (e.g., by study quality or outcome type); no sensitivity checks or inclusion criteria details are reported to support this, undermining the moderator finding.
Authors: We acknowledge that the moderator result is exploratory and that the absence of reported sensitivity checks limits confidence in the finding. We will add sensitivity analyses (e.g., by study quality and outcome type) and expand the Methods section with explicit inclusion criteria. The Results text will be revised to present the moderator finding with appropriate qualification regarding sample size and potential confounding. revision: yes
Circularity Check
No significant circularity; empirical synthesis of external studies
full rationale
The paper reports a standard Bayesian multilevel meta-regression fitted to 24 external primary studies (3 newly added). The headline effect g = 0.40 [0.14, 0.67] and moderator findings are direct model outputs from those independent data points; no equation, parameter, or claim reduces by construction to a fitted quantity defined from the same inputs, no self-citation chain bears the central result, and no ansatz or uniqueness theorem is smuggled in. Representativeness of the sample is a validity concern, not a circularity issue. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 24 included studies constitute a representative sample of generative AI interventions for mathematics learning.
- standard math The Bayesian multilevel meta-regression model correctly accounts for nesting and cumulative data structure.
Forward citations
Cited by 2 Pith papers
-
Can the Recovery Mechanism Survive AI? Skill Formation, Labor, and What Current Measurement Misses
Generative AI may break the education-based recovery mechanism for technological displacement, as evidence shows performance gains without learning gains and current measurements miss the knowledge dimension of cognition.
-
Can the Recovery Mechanism Survive AI? Skill Formation, Labor, and What Current Measurement Misses
Generative AI risks eroding the developmental process of learning by performing high-level cognitive work, creating a paradox where it helps current workers but may undermine future capacity building, requiring new ou...
Reference graph
Works this paper leans on
-
[1]
LLAMA LIMA: A Living Meta-Analysis on the Effects of Generative AI on Learning Mathematics
LLAMA LIMA: A Living Meta-Analysis on the Effects of Generative AI on Learning Mathematics Version 2, 03/26 Anselm Strohmaier, Samira Bödefeld, Oliver Straser, Frank Reinhold University of Education Freiburg, Institute of Mathematics Education Abstract. The capabilities of generative AI in mathematics education are rapidly evolving, posing significant chall...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.18685 2026
-
[2]
Changes to the previous version: Oliver Straser is now a co-author. This version includes 6 additional studies with 11 new effects. Analyses, results, figures and tables have been updated. Publication bias analyses with RoBMA have been added. Additional references have been added in the introduction. Changed the wording roles to purposes in the theoretica...
work page 2023
-
[3]
More recent syntheses illustrate how quickly evidence assessments become outdated as model capabilities evolve: For example, the scoping review by Pepin et al. (2025), published in February 2025 and based on studies available up to May 2024, discusses limitations in ChatGPT’s mathematical performance that have mostly been mitigated by subsequent model ver...
work page 2025
-
[4]
and the meta-analysis by Wang and Fan (2025). We propose a set of five categories that describe potential purposes through which generative AI may support students’ mathematical learning. Generative AI as a mathematics expert. Generative AI systems can generate correct answers and complete solutions for a wide range of school-relevant mathematical tasks (e...
work page 2025
-
[5]
and are likely to account for variability in observed effects across studies. Another characteristic that might moderate the effectiveness of generative AI interventions is the underlying theory of learning guiding their design. Across studies, generative AI may be embedded within different instructional paradigms—such as direct instruction, problem-based or...
work page 2014
-
[6]
and an update of the publication at the alternating month (i.e., the next version is scheduled for May 2026). Depending on the frequency of new publications and their influence on the overall effect and feasibility of moderator analyses, these intervals might be altered in the future. Reports that had been excluded in previous versions might be included in ...
work page 2026
-
[7]
The study is planned to be retired from the living mode and published as a permanent version eventually, but as of now, there is no prespecified timeline. 3.2 Literature Search The literature search for the current version of the living meta-analysis was conducted on February 2nd, 2026, using SCOPUS for documents and preprints. The search targeted experime...
work page 2026
-
[8]
During screening, studies were included if they a) reported original data from an experimental or quasi-experimental intervention study, b) used generative AI in the intervention and no generative AI in the control group, c) involved human learners, d) reported mathematics performance as an outcome measure, and e) were written in English. We included stud...
work page 2000
-
[9]
Studies included in meta-analysis 1Version933n= 1Version3(1)n= 12(2)n= 8(3)n= 6(4)n= New studies included 1Version888n= 1Version45n= 1Version1n= 1Version44n= 1Version15n= 1Version15n= 8 Participant characteristics. Participant characteristics included learners’ educational level based on the International Standard Classification of Education (ISCED; Unesc...
work page 2012
-
[10]
4.4 Publication bias Publication bias was assessed using the multilevel robust Bayesian model-averaged meta-analytic framework implemented in RoBMA (Bartoš & Maier, 2020; Bartoš, Maier, et al., 2025). This approach averages across models with and without publication-bias adjustments and quantifies evidence via Bayes factors. The inclusion Bayes factor for ...
work page 2020
-
[11]
Cumulative Bayesian meta-analysis over time. Study-level effect estimates (Hedges’ g) are shown as points at their publication dates, with point size proportional to the effective sampling precision of each study, accounting for within-study dependence. The smoothed line and shaded region indicate the posterior median and 95% credible interval of the pooled...
work page 2024
-
[12]
Our analysis shows a small positive average effect (g = 0.42) across 21 studies and 38 effect sizes
5 Discussion LLAMA LIMA provides an ongoing synthesis of intervention studies that use generative AI to support mathematics learning. Our analysis shows a small positive average effect (g = 0.42) across 21 studies and 38 effect sizes. Together with the wide credible intervals and substantial heterogeneity this suggests that generative AI-based interventions...
work page 2020
-
[13]
can be used, which indicates that the effect is, right now, relatively small. Regarding results not specific to mathematics, Wang and Fan (2025) reported a substantially higher mean effect of g = 0.87 of using ChatGPT on learning performance, but might be highly influenced by publication bias (Bartoš, Martinková, et al., 2025). Hattie’s hinge point (d = 0.40; Hattie,
work page 2025
-
[14]
might also be considered as a benchmark. However, it must be considered that this effect size typically stems directly from pre-post comparisons. In contrast, in our meta-analysis we determine effect sizes as differences in gain of the intervention group compared to a control group. The substantial heterogeneity of effects across studies indicates that the eff...
-
[15]
https://doi.org/10.18637/jss.v080.i01 Canonigo, A. M. (2024). Levering AI to enhance students' conceptual understanding and confidence in mathematics. Journal of computer assisted learning, 40(6), 3215-3229. Cheng, L., Croteau, E., Baral, S., Heffernan, C., & Heffernan, N. (2024). Facilitating student learning with a chatbot in an online math learning platfo...
-
[16]
Liu, Y., Zha, S., Zhang, Y., Wang, Y., Zhang, Y., Xin, Q., Nie, L. Y., Zhang, C., & Xu, Y. (2025). BrickSmart: Leveraging Generative AI to Support Children's Spatial Language Learning in Family Block Play. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, Ma, N., & Zhong, Z. (2025). A Meta-Analysis of the Impact of Generative A...
work page 2025
-
[17]
Ng, D. T. K., Chan, E. K. C., & Lo, C. K. (2025). Opportunities, challenges and school strategies for integrating generative AI in education. Computers and Education: Artificial Intelligence, 100373. OECD. (2006). Assessing Scientific, Reading and Mathematical Literacy: A Framework for PISA
work page 2025
-
[18]
https://doi.org/10.1787/9789264026407-en 14 Pardos, Z
PISA, OECD Publishing. https://doi.org/10.1787/9789264026407-en 14 Pardos, Z. A., & Bhandari, S. (2024). ChatGPT-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills. PLoS ONE, 19(5), e0304013. Pepin, B., Buchholtz, N., & Salinas-Hernández, U. (2025). A Scoping Survey of ChatGPT in Mathematics Education. Dig...
-
[19]
Rücker, C. R., & Becker-Genschow, S. (2025). Enhancing Enthusiasm for STEM Education with AI: Domain-Specific Chatbot as Personalized Learning Assistant. Computers and Education Open, 100315. https://doi.org/10.1016/j.caeo.2025.100315 Schneider, M., & Stern, E. (2010). The cognitive perspective on learning: Ten cornerstone findings. In O. f. E. C.-O. a. D. ...
-
[20]
UNESCO. Utami, I. Q., Hwang, W.-Y., & Hariyanti, U. (2024). Contextualized and personalized math word problem generation in authentic contexts using generative pre-trained transformer and its influences on geometry learning. Journal of Educational Computing Research, 62(6), 1384-1419. https://doi.org/10.1177/07356331241249225 Viechtbauer, W. (2010). Conduc...
-
[21]
https://doi.org/10.18637/jss.v036.i03 Wahba, F., Ajlouni, A. O., & Abumosa, M. A. (2024). The impact of ChatGPT-based learning statistics on undergraduates’ statistical reasoning and attitudes toward statistics. Eurasia Journal of Mathematics, Science and Technology Education, 20(7), em2468. Walkington, C. (2025). The implications of generative artificial ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.