LLAMA LIMA: A Living Meta-Analysis on the Effects of Generative AI on Learning Mathematics

Anselm Strohmaier; Frank Reinhold; Oliver Straser; Samira B\"odefeld

arxiv: 2601.18685 · v3 · pith:NLOFGQFMnew · submitted 2026-01-26 · 🧮 math.HO · cs.LG

LLAMA LIMA: A Living Meta-Analysis on the Effects of Generative AI on Learning Mathematics

Anselm Strohmaier , Samira B\"odefeld , Oliver Straser , Frank Reinhold This is my paper

Pith reviewed 2026-05-25 07:24 UTC · model grok-4.3

classification 🧮 math.HO cs.LG

keywords generative AImathematics educationmeta-analysislearning outcomesBayesian modelingeducational technologyliving systematic review

0 comments

The pith

Generative AI shows a moderate positive effect on mathematics learning, larger when it complements rather than replaces teachers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up a living meta-analysis that keeps adding new studies on generative AI tools for math education and re-runs the numbers at intervals. It pools results from 24 studies with a Bayesian multilevel model and reports an overall positive effect size. The analysis also finds that effects are stronger in settings where AI supports ongoing classroom work instead of standing in for the teacher. A reader would care because the approach produces an evidence summary that does not go stale in a fast-moving area.

Core claim

The authors report from the third update of their living meta-analysis that generative AI-based interventions produce a positive effect on mathematics learning outcomes with g = 0.40 and a credible interval of [0.14, 0.67]. They find no indication of publication bias across the included studies and moderate evidence that the benefits are larger when the AI is used to complement regular instruction rather than to replace teachers.

What carries the argument

A Bayesian multilevel meta-regression applied to nested, accumulating data from studies that meet PRISMA-LSR criteria, with periodic preprint updates.

If this is right

Generative AI can be expected to improve math learning outcomes on average when added to existing instruction.
Replacement of teachers by AI alone is likely to produce smaller gains than complementary use.
The lack of detected publication bias supports treating the aggregated effect size as reliable for the current evidence base.
Continued updates to the meta-analysis will allow the effect estimate and moderator conclusions to be refined as more data arrive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The wide credible interval signals that future studies should target specific age groups or math topics to tighten the estimate.
The living-review format could be copied for other subjects where AI tools are spreading quickly.
School systems might test hybrid models that keep teachers in charge while adding AI support, using the moderator result as a starting hypothesis.

Load-bearing premise

The 24 studies form a representative and sufficiently unbiased sample of generative AI interventions in mathematics education.

What would settle it

A new wave of large, well-designed studies that shift the credible interval to include zero or negative values would undermine the claim of a positive overall effect.

Figures

Figures reproduced from arXiv: 2601.18685 by Anselm Strohmaier, Frank Reinhold, Oliver Straser, Samira B\"odefeld.

**Figure 1.** Figure 1: PRISMA-LSR Flow diagram documenting the study selection process for the living metaanalysis. Identification Screening Included New records identified from: (1) SCOPUS (n = 91) (2) SCOPUS Preprints (n = 72) (3) Citation searching (n = 12) Records screened (n = 175) Records sought for retrieval (n = 17) Reports assessed for eligibility (n = 16) New studies included (n = 6) Records excluded (n = 158) Reports… view at source ↗

read the original abstract

The capabilities of generative AI in mathematics education are rapidly evolving, posing significant challenges for research to keep pace. Research syntheses remain scarce and risk being outdated by the time of publication. To address this issue, we present a Living Meta-Analysis (LIMA) on the effects of generative AI-based interventions for learning mathematics. Following PRISMA-LSR guidelines, we continuously update the literature base, apply a Bayesian multilevel meta-regression model to account for nested and cumulative data, and publish updated versions on a preprint server at regular intervals. This paper reports results from the third version, including 24 studies, 3 of which were newly included since the second version. The analyses indicate a positive effect (g = 0.40) with a wide credible interval [0.14, 0.67], reflecting the still limited evidence base. Results indicate no publication bias. Moderator analyses indicate moderate evidence that generative AI is more beneficial when it complements regular instruction rather than replacing teachers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This living meta-analysis gives a g=0.40 signal for generative AI in math learning with a moderator hint favoring complementarity, but the small study count and missing methods details make the claims preliminary at best.

read the letter

The main takeaway is that the paper runs a living meta-analysis on generative AI interventions in math education and reports a positive overall effect (g = 0.40, credible interval 0.14 to 0.67) plus moderate evidence that the tools help more when they supplement rather than replace regular teaching. The living format with PRISMA-LSR updates and a Bayesian multilevel model is a sensible way to handle a field that moves this fast, and the third version adds three new studies to reach 24 total. That setup itself is the clearest contribution here. It also checks for publication bias and finds none, which is worth noting even if the test power is limited. The approach shows honest effort to keep the synthesis current instead of letting it go stale. The soft spots are straightforward. Twenty-four studies is still a thin base for a multilevel model that needs to estimate an overall effect, a moderator coefficient, and variance components at once. The wide credible interval already flags the uncertainty, and the moderator result on complementarity versus replacement sits on the representativeness of those 24 papers. If the included studies skew toward certain tools, grade levels, or regions, or if complementarity lines up with better-designed studies, the posterior could move. The abstract gives no inclusion criteria, exact model specification, or sensitivity checks, so it is impossible to judge how stable the numbers are. The stress-test concern about the studies forming an unbiased sample lands directly on the reported results. This work is mainly for people who follow meta-analytic methods in education technology or who need a quick current snapshot while waiting for more primary studies. It is not yet solid enough to drive classroom policy. I would send it to peer review because the living-meta idea is worth referee attention and the topic is timely, but the methods section will need substantial expansion and the moderator claim will need robustness checks before it can be taken as more than exploratory.

Referee Report

2 major / 1 minor

Summary. The paper presents the third version of LLAMA LIMA, a living meta-analysis following PRISMA-LSR guidelines that continuously updates a Bayesian multilevel meta-regression synthesizing evidence on generative AI interventions for mathematics learning. With 24 studies (3 newly added), it reports an overall positive effect (g = 0.40, 95% CrI [0.14, 0.67]), no publication bias, and moderate evidence from moderator analyses that generative AI is more beneficial when complementing rather than replacing regular instruction.

Significance. If the results hold after addressing transparency issues, this provides a timely, dynamically updated synthesis of an emerging area, with the living format and Bayesian approach offering strengths in handling cumulative data and uncertainty. The explicit acknowledgment of the limited evidence base and wide credible interval is a credit to the work's caution.

major comments (2)

[Methods] Methods section: The description of the Bayesian multilevel meta-regression provides no model equation, prior specifications, variance component details, or coding of the complementarity moderator, which are load-bearing for assessing the stability of the moderator posterior given only 24 studies and the wide overall CrI.
[Results] Results section (moderator analyses): The claim of 'moderate evidence' for complementarity over replacement depends on the 24 studies forming a representative sample without confounding (e.g., by study quality or outcome type); no sensitivity checks or inclusion criteria details are reported to support this, undermining the moderator finding.

minor comments (1)

[Abstract] The abstract and title introduce 'LLAMA LIMA' without a clear expansion of the acronym on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which identifies key areas where greater transparency will strengthen the manuscript. We address each major comment below.

read point-by-point responses

Referee: [Methods] Methods section: The description of the Bayesian multilevel meta-regression provides no model equation, prior specifications, variance component details, or coding of the complementarity moderator, which are load-bearing for assessing the stability of the moderator posterior given only 24 studies and the wide overall CrI.

Authors: We agree that the Methods section requires additional detail to allow readers to evaluate the model. In the revised manuscript we will add the model equation, specify the priors (weakly informative normal and half-Cauchy distributions), report the variance components, and describe the binary coding of the complementarity moderator. These changes directly address the concern about assessing stability with only 24 studies. revision: yes
Referee: [Results] Results section (moderator analyses): The claim of 'moderate evidence' for complementarity over replacement depends on the 24 studies forming a representative sample without confounding (e.g., by study quality or outcome type); no sensitivity checks or inclusion criteria details are reported to support this, undermining the moderator finding.

Authors: We acknowledge that the moderator result is exploratory and that the absence of reported sensitivity checks limits confidence in the finding. We will add sensitivity analyses (e.g., by study quality and outcome type) and expand the Methods section with explicit inclusion criteria. The Results text will be revised to present the moderator finding with appropriate qualification regarding sample size and potential confounding. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical synthesis of external studies

full rationale

The paper reports a standard Bayesian multilevel meta-regression fitted to 24 external primary studies (3 newly added). The headline effect g = 0.40 [0.14, 0.67] and moderator findings are direct model outputs from those independent data points; no equation, parameter, or claim reduces by construction to a fitted quantity defined from the same inputs, no self-citation chain bears the central result, and no ansatz or uniqueness theorem is smuggled in. Representativeness of the sample is a validity concern, not a circularity issue. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard meta-analytic assumptions about study representativeness and model adequacy rather than on new free parameters or invented entities.

axioms (2)

domain assumption The 24 included studies constitute a representative sample of generative AI interventions for mathematics learning.
Invoked when interpreting the pooled effect and moderator results as generalizable.
standard math The Bayesian multilevel meta-regression model correctly accounts for nesting and cumulative data structure.
Stated as the method used to produce the g=0.40 estimate and credible interval.

pith-pipeline@v0.9.0 · 5715 in / 1359 out tokens · 46073 ms · 2026-05-25T07:24:49.479343+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Can the Recovery Mechanism Survive AI? Skill Formation, Labor, and What Current Measurement Misses
cs.CY 2026-04 unverdicted novelty 5.0

Generative AI may break the education-based recovery mechanism for technological displacement, as evidence shows performance gains without learning gains and current measurements miss the knowledge dimension of cognition.
Can the Recovery Mechanism Survive AI? Skill Formation, Labor, and What Current Measurement Misses
cs.CY 2026-04 unverdicted novelty 4.0

Generative AI risks eroding the developmental process of learning by performing high-level cognitive work, creating a paradox where it helps current workers but may undermine future capacity building, requiring new ou...

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

LLAMA LIMA: A Living Meta-Analysis on the Effects of Generative AI on Learning Mathematics

LLAMA LIMA: A Living Meta-Analysis on the Eﬀects of Generative AI on Learning Mathematics Version 2, 03/26 Anselm Strohmaier, Samira Bödefeld, Oliver Straser, Frank Reinhold University of Education Freiburg, Institute of Mathematics Education Abstract. The capabilities of generative AI in mathematics education are rapidly evolving, posing signiﬁcant chall...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.18685 2026
[2]

dialogic partner

Changes to the previous version: Oliver Straser is now a co-author. This version includes 6 additional studies with 11 new effects. Analyses, results, figures and tables have been updated. Publication bias analyses with RoBMA have been added. Additional references have been added in the introduction. Changed the wording roles to purposes in the theoretica...

work page 2023
[3]

More recent syntheses illustrate how quickly evidence assessments become outdated as model capabilities evolve: For example, the scoping review by Pepin et al. (2025), published in February 2025 and based on studies available up to May 2024, discusses limitations in ChatGPT’s mathematical performance that have mostly been mitigated by subsequent model ver...

work page 2025
[4]

We propose a set of ﬁve categories that describe potential purposes through which generative AI may support students’ mathematical learning

and the meta-analysis by Wang and Fan (2025). We propose a set of ﬁve categories that describe potential purposes through which generative AI may support students’ mathematical learning. Generative AI as a mathematics expert. Generative AI systems can generate correct answers and complete solutions for a wide range of school-relevant mathematical tasks (e...

work page 2025
[5]

Another characteristic that might moderate the eﬀectiveness of generative AI interventions is the underlying theory of learning guiding their design

and are likely to account for variability in observed eﬀects across studies. Another characteristic that might moderate the eﬀectiveness of generative AI interventions is the underlying theory of learning guiding their design. Across studies, generative AI may be embedded within diﬀerent instructional paradigms—such as direct instruction, problem-based or...

work page 2014
[6]

Depending on the frequency of new publications and their inﬂuence on the overall eﬀect and feasibility of moderator analyses, these intervals might be altered in the future

and an update of the publication at the alternating month (i.e., the next version is scheduled for May 2026). Depending on the frequency of new publications and their inﬂuence on the overall eﬀect and feasibility of moderator analyses, these intervals might be altered in the future. Reports that had been excluded in previous versions might be included in ...

work page 2026
[7]

3.2 Literature Search The literature search for the current version of the living meta-analysis was conducted on February 2nd, 2026, using SCOPUS for documents and preprints

The study is planned to be retired from the living mode and published as a permanent version eventually, but as of now, there is no prespeciﬁed timeline. 3.2 Literature Search The literature search for the current version of the living meta-analysis was conducted on February 2nd, 2026, using SCOPUS for documents and preprints. The search targeted experime...

work page 2026
[8]

We included studies published in peer-reviewed journals, edited book chapters, and conference proceedings, as well as preprints that undergoing peer review at the time of screening

During screening, studies were included if they a) reported original data from an experimental or quasi-experimental intervention study, b) used generative AI in the intervention and no generative AI in the control group, c) involved human learners, d) reported mathematics performance as an outcome measure, and e) were written in English. We included stud...

work page 2000
[9]

Studies included in meta-analysis 1Version933n= 1Version3(1)n= 12(2)n= 8(3)n= 6(4)n= New studies included 1Version888n= 1Version45n= 1Version1n= 1Version44n= 1Version15n= 1Version15n= 8 Participant characteristics. Participant characteristics included learners’ educational level based on the International Standard Classification of Education (ISCED; Unesc...

work page 2012
[10]

This approach averages across models with and without publication-bias adjustments and quantiﬁes evidence via Bayes factors

4.4 Publication bias Publication bias was assessed using the multilevel robust Bayesian model-averaged meta-analytic framework implemented in RoBMA (Bartoš & Maier, 2020; Bartoš, Maier, et al., 2025). This approach averages across models with and without publication-bias adjustments and quantiﬁes evidence via Bayes factors. The inclusion Bayes factor for ...

work page 2020
[11]

Cumulative Bayesian meta-analysis over time. Study-level eﬀect estimates (Hedges’ g) are shown as points at their publication dates, with point size proportional to the eﬀective sampling precision of each study, accounting for within-study dependence. The smoothed line and shaded region indicate the posterior median and 95% credible interval of the pooled...

work page 2024
[12]

Our analysis shows a small positive average eﬀect (g = 0.42) across 21 studies and 38 eﬀect sizes

5 Discussion LLAMA LIMA provides an ongoing synthesis of intervention studies that use generative AI to support mathematics learning. Our analysis shows a small positive average eﬀect (g = 0.42) across 21 studies and 38 eﬀect sizes. Together with the wide credible intervals and substantial heterogeneity this suggests that generative AI-based interventions...

work page 2020
[13]

can be used, which indicates that the eﬀect is, right now, relatively small. Regarding results not speciﬁc to mathematics, Wang and Fan (2025) reported a substantially higher mean eﬀect of g = 0.87 of using ChatGPT on learning performance, but might be highly inﬂuenced by publication bias (Bartoš, Martinková, et al., 2025). Hattie’s hinge point (d = 0.40; Hattie,

work page 2025
[14]

However, it must be considered that this eﬀect size typically stems directly from pre-post comparisons

might also be considered as a benchmark. However, it must be considered that this eﬀect size typically stems directly from pre-post comparisons. In contrast, in our meta-analysis we determine eﬀect sizes as diﬀerences in gain of the intervention group compared to a control group. The substantial heterogeneity of eﬀects across studies indicates that the eﬀ...

work page doi:10.1136/bmj-2024-079183 2025
[15]

https://doi.org/10.18637/jss.v080.i01 Canonigo, A. M. (2024). Levering AI to enhance students' conceptual understanding and conﬁdence in mathematics. Journal of computer assisted learning, 40(6), 3215-3229. Cheng, L., Croteau, E., Baral, S., Heﬀernan, C., & Heﬀernan, N. (2024). Facilitating student learning with a chatbot in an online math learning platfo...

work page doi:10.18637/jss.v080.i01 2024
[16]

Y., Zhang, C., & Xu, Y

Liu, Y., Zha, S., Zhang, Y., Wang, Y., Zhang, Y., Xin, Q., Nie, L. Y., Zhang, C., & Xu, Y. (2025). BrickSmart: Leveraging Generative AI to Support Children's Spatial Language Learning in Family Block Play. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, Ma, N., & Zhong, Z. (2025). A Meta-Analysis of the Impact of Generative A...

work page 2025
[17]

Ng, D. T. K., Chan, E. K. C., & Lo, C. K. (2025). Opportunities, challenges and school strategies for integrating generative AI in education. Computers and Education: Artiﬁcial Intelligence, 100373. OECD. (2006). Assessing Scientiﬁc, Reading and Mathematical Literacy: A Framework for PISA

work page 2025
[18]

https://doi.org/10.1787/9789264026407-en 14 Pardos, Z

PISA, OECD Publishing. https://doi.org/10.1787/9789264026407-en 14 Pardos, Z. A., & Bhandari, S. (2024). ChatGPT-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills. PLoS ONE, 19(5), e0304013. Pepin, B., Buchholtz, N., & Salinas-Hernández, U. (2025). A Scoping Survey of ChatGPT in Mathematics Education. Dig...

work page doi:10.1787/9789264026407-en 2024
[19]

R., & Becker-Genschow, S

Rücker, C. R., & Becker-Genschow, S. (2025). Enhancing Enthusiasm for STEM Education with AI: Domain-Speciﬁc Chatbot as Personalized Learning Assistant. Computers and Education Open, 100315. https://doi.org/10.1016/j.caeo.2025.100315 Schneider, M., & Stern, E. (2010). The cognitive perspective on learning: Ten cornerstone ﬁndings. In O. f. E. C.-O. a. D. ...

work page doi:10.1016/j.caeo.2025.100315 2025
[20]

Utami, I

UNESCO. Utami, I. Q., Hwang, W.-Y., & Hariyanti, U. (2024). Contextualized and personalized math word problem generation in authentic contexts using generative pre-trained transformer and its inﬂuences on geometry learning. Journal of Educational Computing Research, 62(6), 1384-1419. https://doi.org/10.1177/07356331241249225 Viechtbauer, W. (2010). Conduc...

work page doi:10.1177/07356331241249225 2024
[21]

generative AI

https://doi.org/10.18637/jss.v036.i03 Wahba, F., Ajlouni, A. O., & Abumosa, M. A. (2024). The impact of ChatGPT-based learning statistics on undergraduates’ statistical reasoning and attitudes toward statistics. Eurasia Journal of Mathematics, Science and Technology Education, 20(7), em2468. Walkington, C. (2025). The implications of generative artiﬁcial ...

work page doi:10.18637/jss.v036.i03 2024

[1] [1]

LLAMA LIMA: A Living Meta-Analysis on the Effects of Generative AI on Learning Mathematics

LLAMA LIMA: A Living Meta-Analysis on the Eﬀects of Generative AI on Learning Mathematics Version 2, 03/26 Anselm Strohmaier, Samira Bödefeld, Oliver Straser, Frank Reinhold University of Education Freiburg, Institute of Mathematics Education Abstract. The capabilities of generative AI in mathematics education are rapidly evolving, posing signiﬁcant chall...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.18685 2026

[2] [2]

dialogic partner

Changes to the previous version: Oliver Straser is now a co-author. This version includes 6 additional studies with 11 new effects. Analyses, results, figures and tables have been updated. Publication bias analyses with RoBMA have been added. Additional references have been added in the introduction. Changed the wording roles to purposes in the theoretica...

work page 2023

[3] [3]

More recent syntheses illustrate how quickly evidence assessments become outdated as model capabilities evolve: For example, the scoping review by Pepin et al. (2025), published in February 2025 and based on studies available up to May 2024, discusses limitations in ChatGPT’s mathematical performance that have mostly been mitigated by subsequent model ver...

work page 2025

[4] [4]

We propose a set of ﬁve categories that describe potential purposes through which generative AI may support students’ mathematical learning

and the meta-analysis by Wang and Fan (2025). We propose a set of ﬁve categories that describe potential purposes through which generative AI may support students’ mathematical learning. Generative AI as a mathematics expert. Generative AI systems can generate correct answers and complete solutions for a wide range of school-relevant mathematical tasks (e...

work page 2025

[5] [5]

Another characteristic that might moderate the eﬀectiveness of generative AI interventions is the underlying theory of learning guiding their design

and are likely to account for variability in observed eﬀects across studies. Another characteristic that might moderate the eﬀectiveness of generative AI interventions is the underlying theory of learning guiding their design. Across studies, generative AI may be embedded within diﬀerent instructional paradigms—such as direct instruction, problem-based or...

work page 2014

[6] [6]

Depending on the frequency of new publications and their inﬂuence on the overall eﬀect and feasibility of moderator analyses, these intervals might be altered in the future

and an update of the publication at the alternating month (i.e., the next version is scheduled for May 2026). Depending on the frequency of new publications and their inﬂuence on the overall eﬀect and feasibility of moderator analyses, these intervals might be altered in the future. Reports that had been excluded in previous versions might be included in ...

work page 2026

[7] [7]

3.2 Literature Search The literature search for the current version of the living meta-analysis was conducted on February 2nd, 2026, using SCOPUS for documents and preprints

The study is planned to be retired from the living mode and published as a permanent version eventually, but as of now, there is no prespeciﬁed timeline. 3.2 Literature Search The literature search for the current version of the living meta-analysis was conducted on February 2nd, 2026, using SCOPUS for documents and preprints. The search targeted experime...

work page 2026

[8] [8]

We included studies published in peer-reviewed journals, edited book chapters, and conference proceedings, as well as preprints that undergoing peer review at the time of screening

During screening, studies were included if they a) reported original data from an experimental or quasi-experimental intervention study, b) used generative AI in the intervention and no generative AI in the control group, c) involved human learners, d) reported mathematics performance as an outcome measure, and e) were written in English. We included stud...

work page 2000

[9] [9]

Studies included in meta-analysis 1Version933n= 1Version3(1)n= 12(2)n= 8(3)n= 6(4)n= New studies included 1Version888n= 1Version45n= 1Version1n= 1Version44n= 1Version15n= 1Version15n= 8 Participant characteristics. Participant characteristics included learners’ educational level based on the International Standard Classification of Education (ISCED; Unesc...

work page 2012

[10] [10]

This approach averages across models with and without publication-bias adjustments and quantiﬁes evidence via Bayes factors

4.4 Publication bias Publication bias was assessed using the multilevel robust Bayesian model-averaged meta-analytic framework implemented in RoBMA (Bartoš & Maier, 2020; Bartoš, Maier, et al., 2025). This approach averages across models with and without publication-bias adjustments and quantiﬁes evidence via Bayes factors. The inclusion Bayes factor for ...

work page 2020

[11] [11]

Cumulative Bayesian meta-analysis over time. Study-level eﬀect estimates (Hedges’ g) are shown as points at their publication dates, with point size proportional to the eﬀective sampling precision of each study, accounting for within-study dependence. The smoothed line and shaded region indicate the posterior median and 95% credible interval of the pooled...

work page 2024

[12] [12]

Our analysis shows a small positive average eﬀect (g = 0.42) across 21 studies and 38 eﬀect sizes

5 Discussion LLAMA LIMA provides an ongoing synthesis of intervention studies that use generative AI to support mathematics learning. Our analysis shows a small positive average eﬀect (g = 0.42) across 21 studies and 38 eﬀect sizes. Together with the wide credible intervals and substantial heterogeneity this suggests that generative AI-based interventions...

work page 2020

[13] [13]

can be used, which indicates that the eﬀect is, right now, relatively small. Regarding results not speciﬁc to mathematics, Wang and Fan (2025) reported a substantially higher mean eﬀect of g = 0.87 of using ChatGPT on learning performance, but might be highly inﬂuenced by publication bias (Bartoš, Martinková, et al., 2025). Hattie’s hinge point (d = 0.40; Hattie,

work page 2025

[14] [14]

However, it must be considered that this eﬀect size typically stems directly from pre-post comparisons

might also be considered as a benchmark. However, it must be considered that this eﬀect size typically stems directly from pre-post comparisons. In contrast, in our meta-analysis we determine eﬀect sizes as diﬀerences in gain of the intervention group compared to a control group. The substantial heterogeneity of eﬀects across studies indicates that the eﬀ...

work page doi:10.1136/bmj-2024-079183 2025

[15] [15]

https://doi.org/10.18637/jss.v080.i01 Canonigo, A. M. (2024). Levering AI to enhance students' conceptual understanding and conﬁdence in mathematics. Journal of computer assisted learning, 40(6), 3215-3229. Cheng, L., Croteau, E., Baral, S., Heﬀernan, C., & Heﬀernan, N. (2024). Facilitating student learning with a chatbot in an online math learning platfo...

work page doi:10.18637/jss.v080.i01 2024

[16] [16]

Y., Zhang, C., & Xu, Y

Liu, Y., Zha, S., Zhang, Y., Wang, Y., Zhang, Y., Xin, Q., Nie, L. Y., Zhang, C., & Xu, Y. (2025). BrickSmart: Leveraging Generative AI to Support Children's Spatial Language Learning in Family Block Play. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, Ma, N., & Zhong, Z. (2025). A Meta-Analysis of the Impact of Generative A...

work page 2025

[17] [17]

Ng, D. T. K., Chan, E. K. C., & Lo, C. K. (2025). Opportunities, challenges and school strategies for integrating generative AI in education. Computers and Education: Artiﬁcial Intelligence, 100373. OECD. (2006). Assessing Scientiﬁc, Reading and Mathematical Literacy: A Framework for PISA

work page 2025

[18] [18]

https://doi.org/10.1787/9789264026407-en 14 Pardos, Z

PISA, OECD Publishing. https://doi.org/10.1787/9789264026407-en 14 Pardos, Z. A., & Bhandari, S. (2024). ChatGPT-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills. PLoS ONE, 19(5), e0304013. Pepin, B., Buchholtz, N., & Salinas-Hernández, U. (2025). A Scoping Survey of ChatGPT in Mathematics Education. Dig...

work page doi:10.1787/9789264026407-en 2024

[19] [19]

R., & Becker-Genschow, S

Rücker, C. R., & Becker-Genschow, S. (2025). Enhancing Enthusiasm for STEM Education with AI: Domain-Speciﬁc Chatbot as Personalized Learning Assistant. Computers and Education Open, 100315. https://doi.org/10.1016/j.caeo.2025.100315 Schneider, M., & Stern, E. (2010). The cognitive perspective on learning: Ten cornerstone ﬁndings. In O. f. E. C.-O. a. D. ...

work page doi:10.1016/j.caeo.2025.100315 2025

[20] [20]

Utami, I

UNESCO. Utami, I. Q., Hwang, W.-Y., & Hariyanti, U. (2024). Contextualized and personalized math word problem generation in authentic contexts using generative pre-trained transformer and its inﬂuences on geometry learning. Journal of Educational Computing Research, 62(6), 1384-1419. https://doi.org/10.1177/07356331241249225 Viechtbauer, W. (2010). Conduc...

work page doi:10.1177/07356331241249225 2024

[21] [21]

generative AI

https://doi.org/10.18637/jss.v036.i03 Wahba, F., Ajlouni, A. O., & Abumosa, M. A. (2024). The impact of ChatGPT-based learning statistics on undergraduates’ statistical reasoning and attitudes toward statistics. Eurasia Journal of Mathematics, Science and Technology Education, 20(7), em2468. Walkington, C. (2025). The implications of generative artiﬁcial ...

work page doi:10.18637/jss.v036.i03 2024