Can AI Tools Transform Low-Demand Math Tasks? An Evaluation of Task Modification Capabilities
Pith reviewed 2026-05-10 14:37 UTC · model grok-4.3
The pith
AI tools upgrade low-demand math tasks with only 64 percent average success.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When prompted to upgrade low-demand mathematics tasks, eleven AI tools achieved accurate upgrades 64 percent of the time on average according to the Task Analysis Guide, with performance varying widely, specialized tools only slightly ahead of general ones, and modification success showing a negative correlation with prior classification accuracy.
What carries the argument
The Task Analysis Guide framework, which sorts math tasks into four levels of cognitive demand from memorization to doing mathematics, used here as the target for AI-driven upgrades.
If this is right
- Teachers adapting materials with AI will still need to review outputs because roughly one-third of modifications fail to raise demand properly.
- The ability to modify tasks is a distinct capability from classifying them, so separate AI designs or training may be required for each.
- Both undershooting the target demand and overshooting into unrealistic levels occur frequently and must be addressed in future tools.
- Specialized education tools currently deliver only small gains over general AI for this specific teacher task.
Where Pith is reading between the lines
- Combining classification and modification steps inside one AI workflow could reduce the observed negative correlation and improve overall reliability.
- Testing the same tasks with highly optimized prompts would show how much the current 64 percent reflects typical use versus the best possible prompting.
- The same evaluation design could be applied to science or history tasks to test whether math presents unique challenges for AI modification.
Load-bearing premise
The prompting approach mirrors what knowledgeable teachers would actually do and the Task Analysis Guide alone is enough to judge whether an upgrade succeeded.
What would settle it
A follow-up where practicing teachers review the AI outputs for classroom fit and either accept or reject them at rates matching the 64 percent figure.
read the original abstract
While recent research has explored AI tools' ability to classify the quality of mathematical tasks (arXiv:2603.03512), little is known about their capacity to increase the quality of existing tasks. This study investigated whether AI tools could successfully upgrade low-cognitive-demand mathematics tasks. Eleven tools were tested, including six broadly available, general-purpose AI tools (e.g., ChatGPT and Claude) and five tools specialized for mathematics teachers (e.g., Khanmigo, coteach.ai). Using the Task Analysis Guide framework (Stein & Smith, 1998), we prompted AI tools to modify two different types of low-demand mathematical tasks. The prompting strategy aimed to represent likely approaches taken by knowledgeable teachers, rather than extensive optimization to find a more effective prompt (i.e., an optimistic typical outcome). On average, AI tools were only moderately successful: tasks were accurately upgraded only 64% of the time, with different AI tool performance ranging from quite weak (33%) to broadly successful (88%). Specialized tools were only moderately more successful than general-purpose tools. Failure modes included both "undershooting" (maintaining low cognitive demand) and "overshooting" (elevating tasks to an overly ambitious target category that likely would be rejected by teachers). Interestingly, there was a small negative correlation (r = -.35) between whether a given AI tool was able to correctly classify the cognitive demand of tasks and whether the AI was able to upgrade tasks, showing that the ability to modify tasks (i.e., a generative task) represents a distinct capability from the ability to classify them (i.e., judgement using a rubric). These findings have important implications for understanding AI's potential role in curriculum adaptation and highlight the need for specialized approaches to support teachers in modifying instructional materials.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates the capacity of eleven AI tools (six general-purpose such as ChatGPT and Claude, and five specialized for math teachers) to upgrade low-cognitive-demand mathematics tasks. Using the Task Analysis Guide (TAG) framework of Stein & Smith (1998) and non-optimized prompts intended to reflect typical teacher practice, the study reports that tasks were accurately upgraded 64% of the time on average, with tool performance ranging from 33% to 88%. Specialized tools showed only moderate improvement over general-purpose ones. Failure modes included both undershooting (retaining low demand) and overshooting (elevating to overly ambitious categories), and a small negative correlation (r = -0.35) was observed between tools' task-classification accuracy and their modification success.
Significance. If the empirical results hold, the work supplies concrete evidence that generative task-modification performance is distinct from rubric-based classification performance in current AI systems. The moderate average success rate under realistic prompting conditions, together with the documented failure modes, offers a useful benchmark for researchers and developers working on AI support for mathematics curriculum adaptation and teacher workflow.
major comments (2)
- [Abstract and Methods] Abstract and Methods: The central success metric ('accurately upgraded only 64% of the time') is defined solely by whether AI outputs shift tasks into higher TAG categories. The manuscript explicitly treats overshooting as a failure mode that 'likely would be rejected by teachers,' yet reports no teacher validation, usability ratings, or classroom-acceptance checks on the modified tasks. This leaves the practical interpretation of the 64% figure dependent on an untested alignment between TAG judgments and practitioner standards.
- [Results] Results: The reported average, range, and correlation (r = -.35) are presented without accompanying information on the total number of tasks evaluated, the number of trials or prompts per tool, the exact coding scheme for failure modes, or any measure of inter-rater reliability. These details are required to assess the statistical robustness and replicability of the headline claims.
minor comments (2)
- [Abstract] The abstract refers to 'two different types of low-demand mathematical tasks' without naming or exemplifying those types; a brief description or example in the abstract or early Methods section would improve clarity.
- Consider adding a supplementary table that lists success rates, failure-mode breakdowns, and classification accuracy for each of the eleven tools individually, rather than only aggregate statistics.
Simulated Author's Rebuttal
We appreciate the referee's detailed review and constructive suggestions for improving our manuscript. We address each of the major comments below and indicate the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract and Methods] Abstract and Methods: The central success metric ('accurately upgraded only 64% of the time') is defined solely by whether AI outputs shift tasks into higher TAG categories. The manuscript explicitly treats overshooting as a failure mode that 'likely would be rejected by teachers,' yet reports no teacher validation, usability ratings, or classroom-acceptance checks on the modified tasks. This leaves the practical interpretation of the 64% figure dependent on an untested alignment between TAG judgments and practitioner standards.
Authors: We thank the referee for this observation. The primary metric in our study is based on the TAG framework to provide an objective, research-established measure of cognitive demand. We explicitly discuss overshooting as a potential issue for teachers but did not conduct teacher validation as part of this initial evaluation, focusing instead on AI performance relative to the TAG rubric. We acknowledge that this limits the direct claims about classroom practicality. In the revised manuscript, we will expand the Discussion to include this as a key limitation and propose future work involving teacher usability studies and acceptance checks. revision: yes
-
Referee: [Results] Results: The reported average, range, and correlation (r = -.35) are presented without accompanying information on the total number of tasks evaluated, the number of trials or prompts per tool, the exact coding scheme for failure modes, or any measure of inter-rater reliability. These details are required to assess the statistical robustness and replicability of the headline claims.
Authors: We agree that these methodological details are essential for evaluating the robustness of our findings. Although some of this information is present in the Methods section, we will revise the manuscript to make it more explicit and comprehensive in both the Methods and Results sections. This will include specifying the total number of tasks, the number of trials per tool, the precise definitions and coding for failure modes (undershooting and overshooting), and the inter-rater reliability statistics for the TAG classifications. revision: yes
Circularity Check
No circularity: direct empirical measurements using external rubric
full rationale
This is a purely empirical evaluation study with no derivations, equations, model-based predictions, or fitted parameters. Performance is measured by applying the pre-existing Stein & Smith (1998) Task Analysis Guide rubric to AI-generated task modifications, with success rates reported as direct counts (64% average). The prompting strategy is described as representative rather than optimized, and results are benchmarked against explicit failure modes without any self-referential logic or reduction of claims to inputs. The cited prior work on classification (arXiv:2603.03512) is external to the current measurements and does not bear the load of the modification results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction 1.1 The Challenge of Curriculum Adaptation Teachers rarely serve their students well by implementing curricula exactly as written. Contextual factors like time constraints, student readiness, available resources, and pacing demands require constant adaptation of instructional materials (Remillard, 2005; Zhou & Lo, 2025). In mathematics educat...
work page 2005
-
[2]
This is time-intensive work that competes with teachers’ many other responsibilities
and determine how to modify those elements while maintaining mathematical integrity and coherence with learning goals (Boston & Smith, 2009; Tekkumru-Kisa & Stein, 2015). This is time-intensive work that competes with teachers’ many other responsibilities. 1.2 AI as a Potential Solution Artificial intelligence tools have shown promise in various education...
work page 2009
-
[3]
Limited empirical evidence: Most studies of AI in education focus on student-facing intelligent tutoring (Hwang & Tu, 2021), or more generally, content generation, rather than strategic modification of existing materials
work page 2021
-
[4]
Unclear relationship to classification ability: AIs appear to vary substantially in their ability to classify the cognitive demand of given tasks (Fox et al., 2026). In humans, the ability to classify correctly is considered foundational to generating or modifying correctly. However, it is not clear that LLMs will also be limited in this way, given their ...
work page 2026
-
[5]
Insufficient analysis of failure modes: When AI tools fail to modify tasks successfully, we do not know whether certain error types are more common (e.g., undershooting the target demand level, overshooting the target demand level, or whether they make other types of errors, such as failing to address other classroom needs)
-
[6]
Lack of tool comparison: We do not know the extent of performance variation among currently available AI tools, particularly whether education-specific AI tools will outperform general-purpose AI tools for task modification. 1.4 Theoretical Framework This study employs the Task Analysis Guide (TAG) developed by Stein and Smith (1998), a research-backed fr...
work page 1998
-
[7]
Memorization: Tasks that require reproduction of previously learned facts, rules, formulas, or definitions without connections to concepts or meaning (e.g., What are the decimal and percent equivalents for the fractions ½ & ¼?)
-
[8]
Show your work.) High Cognitive Demand:
Procedures without Connections: Tasks that focus on producing correct answers using procedures without developing conceptual understanding (e.g., Convert the fraction 3/8 to a decimal and a percent. Show your work.) High Cognitive Demand:
-
[9]
Procedures with Connections: Tasks that focus students’ attention on the use of procedures for the purpose of developing a deeper understanding of mathematical concepts and ideas (e.g., Using a 10 x 10 grid, identify the decimal and percent equivalents of 3/8.)
-
[10]
Doing Mathematics : Tasks that require complex, non -algorithmic thinking and demand self - monitoring of one’s own cognitive processes (e.g., Shade 6 of the small squares in the rectangle below. Using the diagram, explain how to determine each of the following: • the percent of the area that is shaded • the decimal part of the area that is shaded • the f...
work page 2022
-
[11]
Can current general and specialized education AI tools successfully modify low-cognitive-demand mathematics tasks to meet the criteria for Procedures with Connections?
-
[12]
How does modification success vary across different AI tools?
-
[13]
Procedures without Connections)?
How does modification success vary across different types of low-demand tasks (Memorization vs. Procedures without Connections)?
-
[14]
What is the relation between an AI tool's ability to accurately classify cognitive demand and its ability to accurately modify tasks?
-
[15]
overshooting) characterize unsuccessful modification attempts?
What patterns of failure (undershooting vs. overshooting) characterize unsuccessful modification attempts?
-
[16]
Methods 2.1 Building on Prior Classification Work 2.1.1 Methods This study builds directly on our prior investigation of AI tools’ ability to classify cognitive demand (Fox, et al., 2026). That study tested the ability of 11 AI tools to classify mathematics tasks using the Task Analysis Guide. The current study uses the same 11 tools and focuses on six ta...
work page 2026
-
[17]
The Task Analysis Guide (TAG.docx)
-
[18]
The original task (Task X.docx)
-
[19]
Modify this task so that it meets the Procedures with Connections classification
A standardized modification prompt Standardized Prompt: "Based on the TAG.docx I uploaded, Task [X].docx is considered a lower cognitive demand task (Memorization/Procedures without Connections). Modify this task so that it meets the Procedures with Connections classification. Use the TAG.docx as the framework to modify the task. Give detailed reasoning a...
-
[20]
Show your work and explain how the division process relates to the meaning of the fraction
Use long division to convert each fraction to its decimal equivalent. Show your work and explain how the division process relates to the meaning of the fraction
-
[21]
Then, convert each decimal to a percent by explaining the connection between decimals and percentages
-
[22]
Represent each fraction visually (e.g., with a pie chart or number line) and explain how this representation supports your conversions. Magic School’s modification of Task M1 failed to meet the criteria for Procedures with Connections and instead created a task that aligns with the criteria for Procedures without Connections. In particular, the modified t...
-
[29]
Create your own comparison problem similar to this one and explain how someone could solve it using the insights you've developed. This modification is a clear overshoot by Coteach as it increased the cognitive demand level beyond the Procedures with Connections level into the Doing Mathematics classification. It requires students to access their prior le...
-
[30]
Discussion 4.1 Interpretation of Findings The 64% overall success rate presents a nuanced picture of AI capabilities for task modification. This performance level suggests that AI tools can meaningfully engage with the challenge of elevating cognitive demand, but they are not yet reliable enough for autonomous deployment without human oversight. The subst...
work page 2026
-
[31]
Conclusion This study provides the first systematic evaluation of AI tools' capacity to modify mathematics tasks to achieve higher cognitive demand. The 64% average success rate demonstrates that many current AI tools struggle with this pedagogical challenge, and the substantial variation across tools (33% – 83%) indicates that thoughtful tool selection i...
-
[32]
References Agyapong, B., Obuobi-Donkor, G., Burback, L., & Wei, Y . (2022). Stress, burnout, anxiety, and depression among teachers: A scoping review. International Journal of Environmental Research and Public Health, 19(17), 10706. https://doi.org/10.3390/ijerph191710706 Arena. (2026, April 1). Arena leaderboard: Compare & benchmark the best frontier AI ...
-
[33]
Compare the original task to the modified task to ensure that the goal of the task (e.g., comparing fractions), numbers, and operations were consistent. If the AI tool changed any of the core components during modification (e.g., the numbers used in the problems), then it was marked as a failure
-
[34]
Note specific lines of text within the adapted task (e.g., "Use long division to solve... ”) that related to the criteria for Procedures with Connections tasks (e.g., “Suggest pathways to follow (explicitly or implicitly) that are broad general procedures... ”)
-
[35]
Review the criteria for every cognitive demand level of the Task Analysis Guide. Note specific lines of the adapted task that connect to criteria across different levels of cognitive demand. For example, if the modified task is very cognitively demanding but does not provide an explicit or implicit solution pathway, then it aligns more with the first crit...
-
[36]
Final judgment is made to decide if the adapted task meets, exceeds, or undershoots the goal of Procedures with Connections. After each reviewer had independently evaluated all their assigned modifications, the reviewers met to compare notes and evaluations of all 66 tasks and made a final determination together. If disagreement occurred among reviewers, ...
-
[37]
Explain your reasoning using what you know about fractions
Without calculating decimal values, predict which class has completed a greater fraction of their garden project. Explain your reasoning using what you know about fractions
-
[38]
What do you notice when you compare them visually? Part 2: Mathematical Analysis
Create visual models (such as grids or bar diagrams) to represent both fractions. What do you notice when you compare them visually? Part 2: Mathematical Analysis
-
[39]
Use at least two different mathematical methods to determine whether 4/100 < 4/96. For each method, explain: Why the method works mathematically, how it connects to the meaning of fractions, what it reveals about the relationship between numerators and denominators
-
[40]
Explain how this method helps you understand why one fraction is larger than the other
Convert both fractions to equivalent fractions with the same denominator. Explain how this method helps you understand why one fraction is larger than the other. Part 3: Pattern Recognition and Generalization
-
[41]
Consider the general case: If you have two fractions with the same numerator (like a/b and a/c), how can you determine which is larger without doing any calculations? Test your rule with other examples
-
[42]
What does it mean for the denominator to be larger when the numerator stays the same?
Explain why your rule works by connecting it to the real-world garden context. What does it mean for the denominator to be larger when the numerator stays the same?
-
[43]
Create your own comparison problem similar to this one and explain how someone could solve it using the insights you've developed
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.