pith. sign in

arxiv: 2601.18685 · v3 · pith:NLOFGQFMnew · submitted 2026-01-26 · 🧮 math.HO · cs.LG

LLAMA LIMA: A Living Meta-Analysis on the Effects of Generative AI on Learning Mathematics

Pith reviewed 2026-05-25 07:24 UTC · model grok-4.3

classification 🧮 math.HO cs.LG
keywords generative AImathematics educationmeta-analysislearning outcomesBayesian modelingeducational technologyliving systematic review
0
0 comments X

The pith

Generative AI shows a moderate positive effect on mathematics learning, larger when it complements rather than replaces teachers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up a living meta-analysis that keeps adding new studies on generative AI tools for math education and re-runs the numbers at intervals. It pools results from 24 studies with a Bayesian multilevel model and reports an overall positive effect size. The analysis also finds that effects are stronger in settings where AI supports ongoing classroom work instead of standing in for the teacher. A reader would care because the approach produces an evidence summary that does not go stale in a fast-moving area.

Core claim

The authors report from the third update of their living meta-analysis that generative AI-based interventions produce a positive effect on mathematics learning outcomes with g = 0.40 and a credible interval of [0.14, 0.67]. They find no indication of publication bias across the included studies and moderate evidence that the benefits are larger when the AI is used to complement regular instruction rather than to replace teachers.

What carries the argument

A Bayesian multilevel meta-regression applied to nested, accumulating data from studies that meet PRISMA-LSR criteria, with periodic preprint updates.

If this is right

  • Generative AI can be expected to improve math learning outcomes on average when added to existing instruction.
  • Replacement of teachers by AI alone is likely to produce smaller gains than complementary use.
  • The lack of detected publication bias supports treating the aggregated effect size as reliable for the current evidence base.
  • Continued updates to the meta-analysis will allow the effect estimate and moderator conclusions to be refined as more data arrive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The wide credible interval signals that future studies should target specific age groups or math topics to tighten the estimate.
  • The living-review format could be copied for other subjects where AI tools are spreading quickly.
  • School systems might test hybrid models that keep teachers in charge while adding AI support, using the moderator result as a starting hypothesis.

Load-bearing premise

The 24 studies form a representative and sufficiently unbiased sample of generative AI interventions in mathematics education.

What would settle it

A new wave of large, well-designed studies that shift the credible interval to include zero or negative values would undermine the claim of a positive overall effect.

Figures

Figures reproduced from arXiv: 2601.18685 by Anselm Strohmaier, Frank Reinhold, Oliver Straser, Samira B\"odefeld.

Figure 1
Figure 1. Figure 1: PRISMA-LSR Flow diagram documenting the study selection process for the living meta￾analysis. Identification Screening Included New records identified from: (1) SCOPUS (n = 91) (2) SCOPUS Preprints (n = 72) (3) Citation searching (n = 12) Records screened (n = 175) Records sought for retrieval (n = 17) Reports assessed for eligibility (n = 16) New studies included (n = 6) Records excluded (n = 158) Reports… view at source ↗
read the original abstract

The capabilities of generative AI in mathematics education are rapidly evolving, posing significant challenges for research to keep pace. Research syntheses remain scarce and risk being outdated by the time of publication. To address this issue, we present a Living Meta-Analysis (LIMA) on the effects of generative AI-based interventions for learning mathematics. Following PRISMA-LSR guidelines, we continuously update the literature base, apply a Bayesian multilevel meta-regression model to account for nested and cumulative data, and publish updated versions on a preprint server at regular intervals. This paper reports results from the third version, including 24 studies, 3 of which were newly included since the second version. The analyses indicate a positive effect (g = 0.40) with a wide credible interval [0.14, 0.67], reflecting the still limited evidence base. Results indicate no publication bias. Moderator analyses indicate moderate evidence that generative AI is more beneficial when it complements regular instruction rather than replacing teachers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents the third version of LLAMA LIMA, a living meta-analysis following PRISMA-LSR guidelines that continuously updates a Bayesian multilevel meta-regression synthesizing evidence on generative AI interventions for mathematics learning. With 24 studies (3 newly added), it reports an overall positive effect (g = 0.40, 95% CrI [0.14, 0.67]), no publication bias, and moderate evidence from moderator analyses that generative AI is more beneficial when complementing rather than replacing regular instruction.

Significance. If the results hold after addressing transparency issues, this provides a timely, dynamically updated synthesis of an emerging area, with the living format and Bayesian approach offering strengths in handling cumulative data and uncertainty. The explicit acknowledgment of the limited evidence base and wide credible interval is a credit to the work's caution.

major comments (2)
  1. [Methods] Methods section: The description of the Bayesian multilevel meta-regression provides no model equation, prior specifications, variance component details, or coding of the complementarity moderator, which are load-bearing for assessing the stability of the moderator posterior given only 24 studies and the wide overall CrI.
  2. [Results] Results section (moderator analyses): The claim of 'moderate evidence' for complementarity over replacement depends on the 24 studies forming a representative sample without confounding (e.g., by study quality or outcome type); no sensitivity checks or inclusion criteria details are reported to support this, undermining the moderator finding.
minor comments (1)
  1. [Abstract] The abstract and title introduce 'LLAMA LIMA' without a clear expansion of the acronym on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which identifies key areas where greater transparency will strengthen the manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [Methods] Methods section: The description of the Bayesian multilevel meta-regression provides no model equation, prior specifications, variance component details, or coding of the complementarity moderator, which are load-bearing for assessing the stability of the moderator posterior given only 24 studies and the wide overall CrI.

    Authors: We agree that the Methods section requires additional detail to allow readers to evaluate the model. In the revised manuscript we will add the model equation, specify the priors (weakly informative normal and half-Cauchy distributions), report the variance components, and describe the binary coding of the complementarity moderator. These changes directly address the concern about assessing stability with only 24 studies. revision: yes

  2. Referee: [Results] Results section (moderator analyses): The claim of 'moderate evidence' for complementarity over replacement depends on the 24 studies forming a representative sample without confounding (e.g., by study quality or outcome type); no sensitivity checks or inclusion criteria details are reported to support this, undermining the moderator finding.

    Authors: We acknowledge that the moderator result is exploratory and that the absence of reported sensitivity checks limits confidence in the finding. We will add sensitivity analyses (e.g., by study quality and outcome type) and expand the Methods section with explicit inclusion criteria. The Results text will be revised to present the moderator finding with appropriate qualification regarding sample size and potential confounding. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical synthesis of external studies

full rationale

The paper reports a standard Bayesian multilevel meta-regression fitted to 24 external primary studies (3 newly added). The headline effect g = 0.40 [0.14, 0.67] and moderator findings are direct model outputs from those independent data points; no equation, parameter, or claim reduces by construction to a fitted quantity defined from the same inputs, no self-citation chain bears the central result, and no ansatz or uniqueness theorem is smuggled in. Representativeness of the sample is a validity concern, not a circularity issue. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard meta-analytic assumptions about study representativeness and model adequacy rather than on new free parameters or invented entities.

axioms (2)
  • domain assumption The 24 included studies constitute a representative sample of generative AI interventions for mathematics learning.
    Invoked when interpreting the pooled effect and moderator results as generalizable.
  • standard math The Bayesian multilevel meta-regression model correctly accounts for nesting and cumulative data structure.
    Stated as the method used to produce the g=0.40 estimate and credible interval.

pith-pipeline@v0.9.0 · 5715 in / 1359 out tokens · 46073 ms · 2026-05-25T07:24:49.479343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Can the Recovery Mechanism Survive AI? Skill Formation, Labor, and What Current Measurement Misses

    cs.CY 2026-04 unverdicted novelty 5.0

    Generative AI may break the education-based recovery mechanism for technological displacement, as evidence shows performance gains without learning gains and current measurements miss the knowledge dimension of cognition.

  2. Can the Recovery Mechanism Survive AI? Skill Formation, Labor, and What Current Measurement Misses

    cs.CY 2026-04 unverdicted novelty 4.0

    Generative AI risks eroding the developmental process of learning by performing high-level cognitive work, creating a paradox where it helps current workers but may undermine future capacity building, requiring new ou...

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    LLAMA LIMA: A Living Meta-Analysis on the Effects of Generative AI on Learning Mathematics

    LLAMA LIMA: A Living Meta-Analysis on the Effects of Generative AI on Learning Mathematics Version 2, 03/26 Anselm Strohmaier, Samira Bödefeld, Oliver Straser, Frank Reinhold University of Education Freiburg, Institute of Mathematics Education Abstract. The capabilities of generative AI in mathematics education are rapidly evolving, posing significant chall...

  2. [2]

    dialogic partner

    Changes to the previous version: Oliver Straser is now a co-author. This version includes 6 additional studies with 11 new effects. Analyses, results, figures and tables have been updated. Publication bias analyses with RoBMA have been added. Additional references have been added in the introduction. Changed the wording roles to purposes in the theoretica...

  3. [3]

    More recent syntheses illustrate how quickly evidence assessments become outdated as model capabilities evolve: For example, the scoping review by Pepin et al. (2025), published in February 2025 and based on studies available up to May 2024, discusses limitations in ChatGPT’s mathematical performance that have mostly been mitigated by subsequent model ver...

  4. [4]

    We propose a set of five categories that describe potential purposes through which generative AI may support students’ mathematical learning

    and the meta-analysis by Wang and Fan (2025). We propose a set of five categories that describe potential purposes through which generative AI may support students’ mathematical learning. Generative AI as a mathematics expert. Generative AI systems can generate correct answers and complete solutions for a wide range of school-relevant mathematical tasks (e...

  5. [5]

    Another characteristic that might moderate the effectiveness of generative AI interventions is the underlying theory of learning guiding their design

    and are likely to account for variability in observed effects across studies. Another characteristic that might moderate the effectiveness of generative AI interventions is the underlying theory of learning guiding their design. Across studies, generative AI may be embedded within different instructional paradigms—such as direct instruction, problem-based or...

  6. [6]

    Depending on the frequency of new publications and their influence on the overall effect and feasibility of moderator analyses, these intervals might be altered in the future

    and an update of the publication at the alternating month (i.e., the next version is scheduled for May 2026). Depending on the frequency of new publications and their influence on the overall effect and feasibility of moderator analyses, these intervals might be altered in the future. Reports that had been excluded in previous versions might be included in ...

  7. [7]

    3.2 Literature Search The literature search for the current version of the living meta-analysis was conducted on February 2nd, 2026, using SCOPUS for documents and preprints

    The study is planned to be retired from the living mode and published as a permanent version eventually, but as of now, there is no prespecified timeline. 3.2 Literature Search The literature search for the current version of the living meta-analysis was conducted on February 2nd, 2026, using SCOPUS for documents and preprints. The search targeted experime...

  8. [8]

    We included studies published in peer-reviewed journals, edited book chapters, and conference proceedings, as well as preprints that undergoing peer review at the time of screening

    During screening, studies were included if they a) reported original data from an experimental or quasi-experimental intervention study, b) used generative AI in the intervention and no generative AI in the control group, c) involved human learners, d) reported mathematics performance as an outcome measure, and e) were written in English. We included stud...

  9. [9]

    Studies included in meta-analysis 1Version933n= 1Version3(1)n= 12(2)n= 8(3)n= 6(4)n= New studies included 1Version888n= 1Version45n= 1Version1n= 1Version44n= 1Version15n= 1Version15n= 8 Participant characteristics. Participant characteristics included learners’ educational level based on the International Standard Classification of Education (ISCED; Unesc...

  10. [10]

    This approach averages across models with and without publication-bias adjustments and quantifies evidence via Bayes factors

    4.4 Publication bias Publication bias was assessed using the multilevel robust Bayesian model-averaged meta-analytic framework implemented in RoBMA (Bartoš & Maier, 2020; Bartoš, Maier, et al., 2025). This approach averages across models with and without publication-bias adjustments and quantifies evidence via Bayes factors. The inclusion Bayes factor for ...

  11. [11]

    Cumulative Bayesian meta-analysis over time. Study-level effect estimates (Hedges’ g) are shown as points at their publication dates, with point size proportional to the effective sampling precision of each study, accounting for within-study dependence. The smoothed line and shaded region indicate the posterior median and 95% credible interval of the pooled...

  12. [12]

    Our analysis shows a small positive average effect (g = 0.42) across 21 studies and 38 effect sizes

    5 Discussion LLAMA LIMA provides an ongoing synthesis of intervention studies that use generative AI to support mathematics learning. Our analysis shows a small positive average effect (g = 0.42) across 21 studies and 38 effect sizes. Together with the wide credible intervals and substantial heterogeneity this suggests that generative AI-based interventions...

  13. [13]

    can be used, which indicates that the effect is, right now, relatively small. Regarding results not specific to mathematics, Wang and Fan (2025) reported a substantially higher mean effect of g = 0.87 of using ChatGPT on learning performance, but might be highly influenced by publication bias (Bartoš, Martinková, et al., 2025). Hattie’s hinge point (d = 0.40; Hattie,

  14. [14]

    However, it must be considered that this effect size typically stems directly from pre-post comparisons

    might also be considered as a benchmark. However, it must be considered that this effect size typically stems directly from pre-post comparisons. In contrast, in our meta-analysis we determine effect sizes as differences in gain of the intervention group compared to a control group. The substantial heterogeneity of effects across studies indicates that the eff...

  15. [15]

    https://doi.org/10.18637/jss.v080.i01 Canonigo, A. M. (2024). Levering AI to enhance students' conceptual understanding and confidence in mathematics. Journal of computer assisted learning, 40(6), 3215-3229. Cheng, L., Croteau, E., Baral, S., Heffernan, C., & Heffernan, N. (2024). Facilitating student learning with a chatbot in an online math learning platfo...

  16. [16]

    Y., Zhang, C., & Xu, Y

    Liu, Y., Zha, S., Zhang, Y., Wang, Y., Zhang, Y., Xin, Q., Nie, L. Y., Zhang, C., & Xu, Y. (2025). BrickSmart: Leveraging Generative AI to Support Children's Spatial Language Learning in Family Block Play. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, Ma, N., & Zhong, Z. (2025). A Meta-Analysis of the Impact of Generative A...

  17. [17]

    Ng, D. T. K., Chan, E. K. C., & Lo, C. K. (2025). Opportunities, challenges and school strategies for integrating generative AI in education. Computers and Education: Artificial Intelligence, 100373. OECD. (2006). Assessing Scientific, Reading and Mathematical Literacy: A Framework for PISA

  18. [18]

    https://doi.org/10.1787/9789264026407-en 14 Pardos, Z

    PISA, OECD Publishing. https://doi.org/10.1787/9789264026407-en 14 Pardos, Z. A., & Bhandari, S. (2024). ChatGPT-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills. PLoS ONE, 19(5), e0304013. Pepin, B., Buchholtz, N., & Salinas-Hernández, U. (2025). A Scoping Survey of ChatGPT in Mathematics Education. Dig...

  19. [19]

    R., & Becker-Genschow, S

    Rücker, C. R., & Becker-Genschow, S. (2025). Enhancing Enthusiasm for STEM Education with AI: Domain-Specific Chatbot as Personalized Learning Assistant. Computers and Education Open, 100315. https://doi.org/10.1016/j.caeo.2025.100315 Schneider, M., & Stern, E. (2010). The cognitive perspective on learning: Ten cornerstone findings. In O. f. E. C.-O. a. D. ...

  20. [20]

    Utami, I

    UNESCO. Utami, I. Q., Hwang, W.-Y., & Hariyanti, U. (2024). Contextualized and personalized math word problem generation in authentic contexts using generative pre-trained transformer and its influences on geometry learning. Journal of Educational Computing Research, 62(6), 1384-1419. https://doi.org/10.1177/07356331241249225 Viechtbauer, W. (2010). Conduc...

  21. [21]

    generative AI

    https://doi.org/10.18637/jss.v036.i03 Wahba, F., Ajlouni, A. O., & Abumosa, M. A. (2024). The impact of ChatGPT-based learning statistics on undergraduates’ statistical reasoning and attitudes toward statistics. Eurasia Journal of Mathematics, Science and Technology Education, 20(7), em2468. Walkington, C. (2025). The implications of generative artificial ...