Faster Completion, Less Learning: Generative AI Reduced Study Time on Math Problems and the Knowledge They Build
Pith reviewed 2026-05-22 09:10 UTC · model grok-4.3
The pith
Generative AI reduces time spent on math problems by up to 31 percent and lowers odds of correct retention by 25 percent on proctored tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
After ChatGPT, time on AI-susceptible text problems declines 2.8 percent per quarter among college students, reaching a 26.9 percent cumulative drop over eleven quarters; high-school students show 31.3 percent, middle-school students 9.0 percent, and fifth-graders no change. Under proctoring the time divergence vanishes. Logistic fixed-effects models on randomly assigned proctored retention items register a 25 percent cumulative decline in odds of correct response, while the identical estimator on non-proctored assessment shows a large increase.
What carries the argument
Quasi-experimental contrast between text-based word problems (transcribable into AI prompts) and graph-based problems (requiring live platform manipulation) within the same curriculum sequence.
If this is right
- Proctored assessments become necessary to measure actual knowledge rather than AI-assisted performance.
- Placement and progress tests that rely on unproctored results will overstate student readiness.
- Curriculum sequences built on cumulative mastery may need redesign if earlier topics are skipped via AI.
- Policy discussions about AI in education must weigh measurable losses in long-term retention against short-term efficiency gains.
Where Pith is reading between the lines
- Curricula could deliberately increase the share of interactive, non-transcribable tasks to limit offloading.
- Longer-term studies could test whether the retention gap closes when AI access is later restricted.
- Similar designs could be applied to other subjects to check whether the pattern is specific to mathematics problem solving.
Load-bearing premise
Any post-ChatGPT divergence between the two problem types is produced by AI use rather than unmeasured shifts in teaching, student habits, or platform design.
What would settle it
If a new cohort shows no decline in proctored retention odds on text-based items relative to graph-based items after the same time period, the claim of reduced durable learning from AI substitution would not hold.
Figures
read the original abstract
How much have students' ordinary learning processes shifted in response to generative AI, and how does that affect their durable learning outcomes? Self-report surveys show little change, while small-scale behavioral studies report widespread AI use without the scale or duration to measure learning consequences. We address both questions using a ten-year panel of $3.2$ million ALEKS learning interactions for the time-on-task analysis, complemented by ALEKS PPL placement-assessment data for the proctoring and retention analyses, with a quasi-experimental design exploiting within-curriculum variation in AI susceptibility: text-based word problems transcribable into AI prompts serve as the treated group; graph-based problems requiring interactive platform manipulation as the comparison. Learning time on AI-susceptible problems declines $2.8\%$ per quarter among college students after ChatGPT's release, cumulating to $26.9\%$ over eleven quarters; high-schoolers show $31.3\%$, middle-schoolers $9.0\%$, and Grade 5 students no detectable change. The divergence vanishes entirely under proctoring for college students, making general efficiency gains unlikely. Logistic fixed-effects models on randomly assigned proctored retention items yield a $25\%$ cumulative decline in odds of correct response; the same estimator on non-proctored assessment produces a large opposite-signed increase -- inconsistent with any platform, cohort, or curriculum explanation. These results are among the first large-scale behavioral and outcome evidence that generative AI has altered how students study and the knowledge they build -- the population-level indicator of \emph{cognitive surrender}, with direct implications for educational research, assessment governance, and AI policy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper uses a ten-year panel of 3.2 million ALEKS interactions and placement-assessment data in a quasi-experimental design that treats text-based word problems as AI-susceptible and graph-based problems as non-susceptible. It reports post-ChatGPT declines in time-on-task (2.8% per quarter for college students, cumulating to 26.9%) that vanish under proctoring, and a 25% cumulative decline in odds of correct response on randomly assigned proctored retention items via logistic fixed-effects models, while non-proctored assessment shows an opposite increase; the authors interpret this as evidence of cognitive surrender induced by generative AI.
Significance. If the identification holds, the study supplies large-scale behavioral and outcome evidence on how generative AI alters study time and durable knowledge in mathematics, with direct implications for assessment design, curriculum policy, and regulation of AI tools in education. The proctoring contrast and within-curriculum variation are strengths that help rule out some platform-wide confounds.
major comments (3)
- The identification rests on the assumption that text-based and graph-based problems would have followed parallel trends absent ChatGPT and that no other post-2022 shocks (curriculum changes, platform scoring updates, or differential engagement) affect the two groups differently. The manuscript does not report explicit pre-trend tests, placebo periods, or robustness checks that reclassify problems or interact with other time-varying covariates; without these, the 25% retention decline and time reductions cannot be cleanly attributed to AI use rather than correlated unobservables.
- The logistic fixed-effects models are described only at a high level in the abstract and results; the manuscript provides neither the exact specification (e.g., the form of the fixed effects, clustering, or handling of multiple observations per student), nor robustness tables showing sensitivity to alternative estimators or sample restrictions. This omission makes it impossible to evaluate whether the reported odds ratio is load-bearing or sensitive to modeling choices.
- The claim that the proctoring contrast rules out general efficiency gains is plausible but incomplete: the manuscript does not show whether the proctored subsample is representative of the full population or whether proctoring itself interacts with problem type in ways that could mechanically alter time or retention independent of AI.
minor comments (2)
- The abstract and results would benefit from a table or figure that directly displays the quarterly time-on-task coefficients by grade band and problem type, with confidence intervals and the exact number of observations per cell.
- Notation for the cumulative decline (26.9% over eleven quarters) should be tied explicitly to the quarterly rate and the functional form used (e.g., whether it is a linear trend or exponential).
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment below, providing clarifications on our identification strategy and committing to revisions that enhance transparency without altering the core findings.
read point-by-point responses
-
Referee: The identification rests on the assumption that text-based and graph-based problems would have followed parallel trends absent ChatGPT and that no other post-2022 shocks (curriculum changes, platform scoring updates, or differential engagement) affect the two groups differently. The manuscript does not report explicit pre-trend tests, placebo periods, or robustness checks that reclassify problems or interact with other time-varying covariates; without these, the 25% retention decline and time reductions cannot be cleanly attributed to AI use rather than correlated unobservables.
Authors: We agree that explicit documentation of pre-trends would strengthen the parallel trends assumption. Our design already incorporates within-curriculum variation and the sharp proctoring contrast (where time reductions vanish and retention effects reverse in non-proctored settings) to address many alternative explanations such as platform-wide changes. Nevertheless, we will add formal pre-trend tests, placebo analyses on pre-ChatGPT periods, and robustness checks with alternative problem reclassifications and time-varying covariates in the revised manuscript. revision: yes
-
Referee: The logistic fixed-effects models are described only at a high level in the abstract and results; the manuscript provides neither the exact specification (e.g., the form of the fixed effects, clustering, or handling of multiple observations per student), nor robustness tables showing sensitivity to alternative estimators or sample restrictions. This omission makes it impossible to evaluate whether the reported odds ratio is load-bearing or sensitive to modeling choices.
Authors: We acknowledge that the current manuscript describes the logistic fixed-effects models at a high level. The specification includes student fixed effects, quarter fixed effects, and problem fixed effects, with standard errors clustered at the student level to account for multiple observations per student. We will expand the methods section with the precise equation, variable definitions, and additional robustness tables (including alternative estimators and sample restrictions) in the revision. revision: yes
-
Referee: The claim that the proctoring contrast rules out general efficiency gains is plausible but incomplete: the manuscript does not show whether the proctored subsample is representative of the full population or whether proctoring itself interacts with problem type in ways that could mechanically alter time or retention independent of AI.
Authors: The ALEKS PPL proctored assessments are randomly assigned, supporting representativeness, and the reversal of effects in non-proctored settings is inconsistent with mechanical proctoring interactions. We agree that explicit checks would be valuable and will add balance tables comparing proctored versus non-proctored samples on observables as well as tests for proctoring-by-problem-type interactions in the revised manuscript. revision: yes
Circularity Check
No significant circularity in quasi-experimental design
full rationale
The paper's derivation relies on an external timing shock (ChatGPT release) and a within-curriculum classification of problems into AI-susceptible (text-based) versus non-susceptible (graph-based) groups, with logistic fixed-effects models applied to randomly assigned proctored retention items and a proctoring check that eliminates the divergence. No equations reduce reported declines or odds ratios to quantities defined by the same fitted parameters or outcomes; the analysis does not invoke self-citations for uniqueness, smuggle ansatzes, or rename known results as new derivations. The central estimates are therefore self-contained against external benchmarks such as the release date and proctoring status.
Axiom & Free-Parameter Ledger
free parameters (1)
- quarterly decline rate
axioms (1)
- domain assumption Text-based word problems can be directly transcribed into AI prompts while graph-based problems cannot
Reference graph
Works this paper leans on
-
[1]
Generative AI Without Guardrails Can Harm Learning: Evidence from High School Mathemat- ics
doi: 10.1073/pnas.2422633122. 12 Preprint. Under review. John Bound, Charles Brown, and Nancy Mathiowetz. Measurement error in survey data. InHandbook of econometrics, volume 5, pp. 3705–3843. Elsevier,
-
[2]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
URL https://arxiv.org/abs/2303.12712. A Colin Cameron and Douglas L Miller. A practitioner’s guide to cluster-robust inference.Journal of human resources, 50(2):317–372,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Ruishi Chen, Victor R Lee, Annie Camey Kuo, Denise Clark Pope, and Sarah Miles
doi: 10.1162/rest.90.3.414. Ruishi Chen, Victor R Lee, Annie Camey Kuo, Denise Clark Pope, and Sarah Miles. Cheating in the second year of generative AI chatbots: a follow-up study on high school student cheating behaviors. Educational technology research and development, 74:649–667,
-
[4]
Michelene TH Chi and Ruth Wylie
doi: 10.1007/s11423-026-10587-1. Michelene TH Chi and Ruth Wylie. The ICAP framework: Linking cognitive engagement to active learning outcomes.Educational psychologist, 49(4):219–243,
-
[5]
doi: 10.1016/j. jmp.2021.102512. Christopher Doble, Jeffrey Matayoshi, Eric Cosyn, Hasan Uzun, and Arash Karami. A data-based simula- tion study of reliability for an adaptive assessment based on knowledge space theory.International Journal of Artificial Intelligence in Education, 29(2):258–282,
work page doi:10.1016/j 2021
-
[6]
doi: 10.1007/978-3-642-58625-5. Manu Kapur. Productive failure in mathematical problem solving.Instructional science, 38(6):523–550,
-
[7]
Grace Liu, Brian Christian, Tsvetomira Dumbalska, Michiel A Bakker, and Rachit Dubey
doi: 10.1016/j.caeai.2024.100253. Grace Liu, Brian Christian, Tsvetomira Dumbalska, Michiel A Bakker, and Rachit Dubey. AI assistance reduces persistence and hurts independent performance
-
[8]
Intelligent tutoring systems and learning outcomes: A meta-analysis,
doi: 10.1037/a0037123. 13 Preprint. Under review. James G. MacKinnon and Matthew D. Webb. Wild bootstrap inference for wildly different cluster sizes. Journal of Applied Econometrics, 32(2):233–254,
-
[9]
doi: 10.1002/jae.2508. Donald L. McCabe and Linda Klebe Trevino. Academic dishonesty: Honor codes and other contextual influences.The Journal of Higher Education, 64(5):522–538,
-
[10]
doi: 10.1080/00221546.1993.11778446. Donald L. McCabe and Linda Klebe Trevino. Individual and contextual influences on academic dishonesty: A multicampus investigation.Research in Higher Education, 38(3):379–396,
-
[11]
doi: 10.1023/A:1024954224675. Donald L. McCabe, Kenneth D. Butterfield, and Linda K. Trevi˜no.Cheating in College: Why Students Do It and What Educators Can Do about It. The Johns Hopkins University Press, Baltimore,
-
[12]
doi: 10.1007/s10639-024-12495-4. Duncan Pritchard. Why technology doesn’t normally make you dumber, but agentic ai will.International Journal of Human–Computer Interaction, 0(0):1–11,
-
[13]
URL https://doi.org/10.1080/10447318.2026.2631678
doi: 10.1080/10447318.2026.2631678. URL https://doi.org/10.1080/10447318.2026.2631678. Justin Reich and Jesse Dukes. The future of education technology after the arrival of ChatGPT.Phi Delta Kappan, 107(3-4):19–23,
-
[14]
Leonhard Reiter, Moritz Joerling, Christoph Fuchs, and Robert B¨ohm
doi: 10.1177/00317217251405516. Leonhard Reiter, Moritz Joerling, Christoph Fuchs, and Robert B¨ohm. Student (Mis)Use of generative AI tools for university-related tasks.International Journal of Human–Computer Interaction, 41(19):12390– 12403,
-
[15]
doi: 10.1080/10447318.2025.2462083. Evan F. Risko and Sam J. Gilbert. Cognitive offloading.Trends in Cognitive Sciences, 20(9):676–688,
-
[16]
Sina Rismanchian, Peter Liu, Gabe Avakian Orona, Duncan Pritchard, and Shayan Doroudi
doi: 10.1016/j.tics.2016.07.002. Sina Rismanchian, Peter Liu, Gabe Avakian Orona, Duncan Pritchard, and Shayan Doroudi. Artificial integrity: Concerning patterns of AI usage among undergraduate students. EdArXiv preprint,
-
[17]
doi: 10.35542/osf.io/exm5a v2. Everett M. Rogers.Diffusion of Innovations. Free Press, New York, NY, 5 edition,
-
[18]
doi: 10.31234/osf.io/yk25n v1. Supplementary Information This appendix reports the battery of ten robustness analyses (R1–R10) applied to each primary specifi- cation in the main text, along with two post-hoc sensitivity analyses (R9, R10) and supporting details on sample construction. The ten analyses are: (R1) a functional-form horse race comparing step...
-
[19]
The College and High School learning-time subsets exhibit a small positive pre-trend across all four placebo windows — AI-susceptible word problems were becoming relativelyslowerthan AI-resistant graph problems in the pre-ChatGPT era. Because this drift is opposite in sign to the post-ChatGPT effect, a trend-adjusted specification would yield a larger neg...
work page 2023
-
[20]
Retention and proctored PPL subsets yield null placebos at every break date; College and High School learning-time subsets yield small positive placebos opposite in sign to the post-ChatGPT effect (see R4). Subset 2018 2019 2020 2021 LearningTime College+0.0071 ∗∗∗ +0.0063∗∗ +0.0055∗ +0.0048 LearningTime HS+0.0046 ∗∗ +0.0041∗∗ +0.0036∗ +0.0031 PPLTime non...
-
[21]
Delta-method and bootstrap CIs agree closely in every subset. The randomly assigned retention subsam- ple produces a bootstrap CI that excludes the null (0.56–0.98). 21 Preprint. Under review. Table 9:R6 Cumulative-effect95%confidence intervals.Cumulative effects over eleven post-ChatGPT quarters, with both delta-method and cluster-bootstrap 95% CIs. For ...
work page 2020
-
[22]
Estimates are stable in sign and magnitude across every cut in every subset
Table 10:R7 Window-sensitivity cuts.Per-quarter ramp coefficient β under three window-sensitivity cuts. Estimates are stable in sign and magnitude across every cut in every subset. Subset BaselineβDrop COVID Drop last quarter Floor=100 LearningTime College−0.0284−0.0277−0.0280−0.0291 LearningTime HS−0.0341−0.0334−0.0338−0.0348 PPLTime nonproc−0.0112−0.010...
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.