Can AI Tools Transform Low-Demand Math Tasks? An Evaluation of Task Modification Capabilities

Brenda L. Robles; Christian D. Schunn; Danielle S. Fox; Elizabeth DiPietro Brovey

arxiv: 2604.12743 · v1 · submitted 2026-04-14 · 💻 cs.AI

Can AI Tools Transform Low-Demand Math Tasks? An Evaluation of Task Modification Capabilities

Danielle S. Fox , Brenda L. Robles , Elizabeth DiPietro Brovey , Christian D. Schunn This is my paper

Pith reviewed 2026-05-10 14:37 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI in educationmathematics taskscognitive demandtask modificationcurriculum adaptationgenerative AIteacher support tools

0 comments

The pith

AI tools upgrade low-demand math tasks with only 64 percent average success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether AI systems can take simple, low-cognitive-demand mathematics tasks and raise their demand level so students engage in more reasoning. Eleven tools received prompts written to reflect typical use by knowledgeable teachers, then the outputs were scored against the Task Analysis Guide categories. Average success reached 64 percent, but individual tools ranged from 33 to 88 percent, and tools specialized for math teachers performed only modestly better than general-purpose ones. The work also reports that skill at classifying demand levels correlates negatively with skill at actually changing tasks. These patterns indicate AI can assist curriculum adaptation yet still requires human oversight for reliable results.

Core claim

When prompted to upgrade low-demand mathematics tasks, eleven AI tools achieved accurate upgrades 64 percent of the time on average according to the Task Analysis Guide, with performance varying widely, specialized tools only slightly ahead of general ones, and modification success showing a negative correlation with prior classification accuracy.

What carries the argument

The Task Analysis Guide framework, which sorts math tasks into four levels of cognitive demand from memorization to doing mathematics, used here as the target for AI-driven upgrades.

If this is right

Teachers adapting materials with AI will still need to review outputs because roughly one-third of modifications fail to raise demand properly.
The ability to modify tasks is a distinct capability from classifying them, so separate AI designs or training may be required for each.
Both undershooting the target demand and overshooting into unrealistic levels occur frequently and must be addressed in future tools.
Specialized education tools currently deliver only small gains over general AI for this specific teacher task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Combining classification and modification steps inside one AI workflow could reduce the observed negative correlation and improve overall reliability.
Testing the same tasks with highly optimized prompts would show how much the current 64 percent reflects typical use versus the best possible prompting.
The same evaluation design could be applied to science or history tasks to test whether math presents unique challenges for AI modification.

Load-bearing premise

The prompting approach mirrors what knowledgeable teachers would actually do and the Task Analysis Guide alone is enough to judge whether an upgrade succeeded.

What would settle it

A follow-up where practicing teachers review the AI outputs for classroom fit and either accept or reject them at rates matching the 64 percent figure.

read the original abstract

While recent research has explored AI tools' ability to classify the quality of mathematical tasks (arXiv:2603.03512), little is known about their capacity to increase the quality of existing tasks. This study investigated whether AI tools could successfully upgrade low-cognitive-demand mathematics tasks. Eleven tools were tested, including six broadly available, general-purpose AI tools (e.g., ChatGPT and Claude) and five tools specialized for mathematics teachers (e.g., Khanmigo, coteach.ai). Using the Task Analysis Guide framework (Stein & Smith, 1998), we prompted AI tools to modify two different types of low-demand mathematical tasks. The prompting strategy aimed to represent likely approaches taken by knowledgeable teachers, rather than extensive optimization to find a more effective prompt (i.e., an optimistic typical outcome). On average, AI tools were only moderately successful: tasks were accurately upgraded only 64% of the time, with different AI tool performance ranging from quite weak (33%) to broadly successful (88%). Specialized tools were only moderately more successful than general-purpose tools. Failure modes included both "undershooting" (maintaining low cognitive demand) and "overshooting" (elevating tasks to an overly ambitious target category that likely would be rejected by teachers). Interestingly, there was a small negative correlation (r = -.35) between whether a given AI tool was able to correctly classify the cognitive demand of tasks and whether the AI was able to upgrade tasks, showing that the ability to modify tasks (i.e., a generative task) represents a distinct capability from the ability to classify them (i.e., judgement using a rubric). These findings have important implications for understanding AI's potential role in curriculum adaptation and highlight the need for specialized approaches to support teachers in modifying instructional materials.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AI tools upgrade low-demand math tasks only 64% of the time on average, with a negative correlation to their classification accuracy and no teacher validation of the results.

read the letter

The main thing to know is that this paper tests eleven AI tools on upgrading low-cognitive-demand math tasks and finds moderate success at best. Average accurate upgrades hit 64%, with tools ranging from 33% to 88%. Specialized math tools did only a bit better than general ones like ChatGPT. The authors also report a negative correlation (r = -.35) between a tool's classification performance and its modification performance, which suggests these are distinct capabilities rather than the same skill applied twice.

Referee Report

2 major / 2 minor

Summary. The paper evaluates the capacity of eleven AI tools (six general-purpose such as ChatGPT and Claude, and five specialized for math teachers) to upgrade low-cognitive-demand mathematics tasks. Using the Task Analysis Guide (TAG) framework of Stein & Smith (1998) and non-optimized prompts intended to reflect typical teacher practice, the study reports that tasks were accurately upgraded 64% of the time on average, with tool performance ranging from 33% to 88%. Specialized tools showed only moderate improvement over general-purpose ones. Failure modes included both undershooting (retaining low demand) and overshooting (elevating to overly ambitious categories), and a small negative correlation (r = -0.35) was observed between tools' task-classification accuracy and their modification success.

Significance. If the empirical results hold, the work supplies concrete evidence that generative task-modification performance is distinct from rubric-based classification performance in current AI systems. The moderate average success rate under realistic prompting conditions, together with the documented failure modes, offers a useful benchmark for researchers and developers working on AI support for mathematics curriculum adaptation and teacher workflow.

major comments (2)

[Abstract and Methods] Abstract and Methods: The central success metric ('accurately upgraded only 64% of the time') is defined solely by whether AI outputs shift tasks into higher TAG categories. The manuscript explicitly treats overshooting as a failure mode that 'likely would be rejected by teachers,' yet reports no teacher validation, usability ratings, or classroom-acceptance checks on the modified tasks. This leaves the practical interpretation of the 64% figure dependent on an untested alignment between TAG judgments and practitioner standards.
[Results] Results: The reported average, range, and correlation (r = -.35) are presented without accompanying information on the total number of tasks evaluated, the number of trials or prompts per tool, the exact coding scheme for failure modes, or any measure of inter-rater reliability. These details are required to assess the statistical robustness and replicability of the headline claims.

minor comments (2)

[Abstract] The abstract refers to 'two different types of low-demand mathematical tasks' without naming or exemplifying those types; a brief description or example in the abstract or early Methods section would improve clarity.
Consider adding a supplementary table that lists success rates, failure-mode breakdowns, and classification accuracy for each of the eleven tools individually, rather than only aggregate statistics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed review and constructive suggestions for improving our manuscript. We address each of the major comments below and indicate the revisions we plan to make.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: The central success metric ('accurately upgraded only 64% of the time') is defined solely by whether AI outputs shift tasks into higher TAG categories. The manuscript explicitly treats overshooting as a failure mode that 'likely would be rejected by teachers,' yet reports no teacher validation, usability ratings, or classroom-acceptance checks on the modified tasks. This leaves the practical interpretation of the 64% figure dependent on an untested alignment between TAG judgments and practitioner standards.

Authors: We thank the referee for this observation. The primary metric in our study is based on the TAG framework to provide an objective, research-established measure of cognitive demand. We explicitly discuss overshooting as a potential issue for teachers but did not conduct teacher validation as part of this initial evaluation, focusing instead on AI performance relative to the TAG rubric. We acknowledge that this limits the direct claims about classroom practicality. In the revised manuscript, we will expand the Discussion to include this as a key limitation and propose future work involving teacher usability studies and acceptance checks. revision: yes
Referee: [Results] Results: The reported average, range, and correlation (r = -.35) are presented without accompanying information on the total number of tasks evaluated, the number of trials or prompts per tool, the exact coding scheme for failure modes, or any measure of inter-rater reliability. These details are required to assess the statistical robustness and replicability of the headline claims.

Authors: We agree that these methodological details are essential for evaluating the robustness of our findings. Although some of this information is present in the Methods section, we will revise the manuscript to make it more explicit and comprehensive in both the Methods and Results sections. This will include specifying the total number of tasks, the number of trials per tool, the precise definitions and coding for failure modes (undershooting and overshooting), and the inter-rater reliability statistics for the TAG classifications. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements using external rubric

full rationale

This is a purely empirical evaluation study with no derivations, equations, model-based predictions, or fitted parameters. Performance is measured by applying the pre-existing Stein & Smith (1998) Task Analysis Guide rubric to AI-generated task modifications, with success rates reported as direct counts (64% average). The prompting strategy is described as representative rather than optimized, and results are benchmarked against explicit failure modes without any self-referential logic or reduction of claims to inputs. The cited prior work on classification (arXiv:2603.03512) is external to the current measurements and does not bear the load of the modification results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study relying on the established Task Analysis Guide framework and standard statistical measures (percentages, correlation) without introducing new mathematical axioms, free parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5641 in / 1288 out tokens · 44738 ms · 2026-05-10T14:37:23.707021+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

[1]

Introduction 1.1 The Challenge of Curriculum Adaptation Teachers rarely serve their students well by implementing curricula exactly as written. Contextual factors like time constraints, student readiness, available resources, and pacing demands require constant adaptation of instructional materials (Remillard, 2005; Zhou & Lo, 2025). In mathematics educat...

work page 2005
[2]

This is time-intensive work that competes with teachers’ many other responsibilities

and determine how to modify those elements while maintaining mathematical integrity and coherence with learning goals (Boston & Smith, 2009; Tekkumru-Kisa & Stein, 2015). This is time-intensive work that competes with teachers’ many other responsibilities. 1.2 AI as a Potential Solution Artificial intelligence tools have shown promise in various education...

work page 2009
[3]

Limited empirical evidence: Most studies of AI in education focus on student-facing intelligent tutoring (Hwang & Tu, 2021), or more generally, content generation, rather than strategic modification of existing materials

work page 2021
[4]

In humans, the ability to classify correctly is considered foundational to generating or modifying correctly

Unclear relationship to classification ability: AIs appear to vary substantially in their ability to classify the cognitive demand of given tasks (Fox et al., 2026). In humans, the ability to classify correctly is considered foundational to generating or modifying correctly. However, it is not clear that LLMs will also be limited in this way, given their ...

work page 2026
[5]

Insufficient analysis of failure modes: When AI tools fail to modify tasks successfully, we do not know whether certain error types are more common (e.g., undershooting the target demand level, overshooting the target demand level, or whether they make other types of errors, such as failing to address other classroom needs)

work page
[6]

Lack of tool comparison: We do not know the extent of performance variation among currently available AI tools, particularly whether education-specific AI tools will outperform general-purpose AI tools for task modification. 1.4 Theoretical Framework This study employs the Task Analysis Guide (TAG) developed by Stein and Smith (1998), a research-backed fr...

work page 1998
[7]

Memorization: Tasks that require reproduction of previously learned facts, rules, formulas, or definitions without connections to concepts or meaning (e.g., What are the decimal and percent equivalents for the fractions ½ & ¼?)

work page
[8]

Show your work.) High Cognitive Demand:

Procedures without Connections: Tasks that focus on producing correct answers using procedures without developing conceptual understanding (e.g., Convert the fraction 3/8 to a decimal and a percent. Show your work.) High Cognitive Demand:

work page
[9]

Procedures with Connections: Tasks that focus students’ attention on the use of procedures for the purpose of developing a deeper understanding of mathematical concepts and ideas (e.g., Using a 10 x 10 grid, identify the decimal and percent equivalents of 3/8.)

work page
[10]

Doing Mathematics : Tasks that require complex, non -algorithmic thinking and demand self - monitoring of one’s own cognitive processes (e.g., Shade 6 of the small squares in the rectangle below. Using the diagram, explain how to determine each of the following: • the percent of the area that is shaded • the decimal part of the area that is shaded • the f...

work page 2022
[11]

Can current general and specialized education AI tools successfully modify low-cognitive-demand mathematics tasks to meet the criteria for Procedures with Connections?

work page
[12]

How does modification success vary across different AI tools?

work page
[13]

Procedures without Connections)?

How does modification success vary across different types of low-demand tasks (Memorization vs. Procedures without Connections)?

work page
[14]

What is the relation between an AI tool's ability to accurately classify cognitive demand and its ability to accurately modify tasks?

work page
[15]

overshooting) characterize unsuccessful modification attempts?

What patterns of failure (undershooting vs. overshooting) characterize unsuccessful modification attempts?

work page
[16]

That study tested the ability of 11 AI tools to classify mathematics tasks using the Task Analysis Guide

Methods 2.1 Building on Prior Classification Work 2.1.1 Methods This study builds directly on our prior investigation of AI tools’ ability to classify cognitive demand (Fox, et al., 2026). That study tested the ability of 11 AI tools to classify mathematics tasks using the Task Analysis Guide. The current study uses the same 11 tools and focuses on six ta...

work page 2026
[17]

The Task Analysis Guide (TAG.docx)

work page
[18]

The original task (Task X.docx)

work page
[19]

Modify this task so that it meets the Procedures with Connections classification

A standardized modification prompt Standardized Prompt: "Based on the TAG.docx I uploaded, Task [X].docx is considered a lower cognitive demand task (Memorization/Procedures without Connections). Modify this task so that it meets the Procedures with Connections classification. Use the TAG.docx as the framework to modify the task. Give detailed reasoning a...

work page
[20]

Show your work and explain how the division process relates to the meaning of the fraction

Use long division to convert each fraction to its decimal equivalent. Show your work and explain how the division process relates to the meaning of the fraction

work page
[21]

Then, convert each decimal to a percent by explaining the connection between decimals and percentages

work page
[22]

window dressing

Represent each fraction visually (e.g., with a pie chart or number line) and explain how this representation supports your conversions. Magic School’s modification of Task M1 failed to meet the criteria for Procedures with Connections and instead created a task that aligns with the criteria for Procedures without Connections. In particular, the modified t...

work page
[29]

This modification is a clear overshoot by Coteach as it increased the cognitive demand level beyond the Procedures with Connections level into the Doing Mathematics classification

Create your own comparison problem similar to this one and explain how someone could solve it using the insights you've developed. This modification is a clear overshoot by Coteach as it increased the cognitive demand level beyond the Procedures with Connections level into the Doing Mathematics classification. It requires students to access their prior le...

work page
[30]

Discussion 4.1 Interpretation of Findings The 64% overall success rate presents a nuanced picture of AI capabilities for task modification. This performance level suggests that AI tools can meaningfully engage with the challenge of elevating cognitive demand, but they are not yet reliable enough for autonomous deployment without human oversight. The subst...

work page 2026
[31]

Conclusion This study provides the first systematic evaluation of AI tools' capacity to modify mathematics tasks to achieve higher cognitive demand. The 64% average success rate demonstrates that many current AI tools struggle with this pedagogical challenge, and the substantial variation across tools (33% – 83%) indicates that thoughtful tool selection i...

work page
[32]

References Agyapong, B., Obuobi-Donkor, G., Burback, L., & Wei, Y . (2022). Stress, burnout, anxiety, and depression among teachers: A scoping review. International Journal of Environmental Research and Public Health, 19(17), 10706. https://doi.org/10.3390/ijerph191710706 Arena. (2026, April 1). Arena leaderboard: Compare & benchmark the best frontier AI ...

work page doi:10.3390/ijerph191710706 2022
[33]

If the AI tool changed any of the core components during modification (e.g., the numbers used in the problems), then it was marked as a failure

Compare the original task to the modified task to ensure that the goal of the task (e.g., comparing fractions), numbers, and operations were consistent. If the AI tool changed any of the core components during modification (e.g., the numbers used in the problems), then it was marked as a failure

work page
[34]

Use long division to solve

Note specific lines of text within the adapted task (e.g., "Use long division to solve... ”) that related to the criteria for Procedures with Connections tasks (e.g., “Suggest pathways to follow (explicitly or implicitly) that are broad general procedures... ”)

work page
[35]

Note specific lines of the adapted task that connect to criteria across different levels of cognitive demand

Review the criteria for every cognitive demand level of the Task Analysis Guide. Note specific lines of the adapted task that connect to criteria across different levels of cognitive demand. For example, if the modified task is very cognitively demanding but does not provide an explicit or implicit solution pathway, then it aligns more with the first crit...

work page
[36]

how many times bigger

Final judgment is made to decide if the adapted task meets, exceeds, or undershoots the goal of Procedures with Connections. After each reviewer had independently evaluated all their assigned modifications, the reviewers met to compare notes and evaluations of all 66 tasks and made a final determination together. If disagreement occurred among reviewers, ...

work page
[37]

Explain your reasoning using what you know about fractions

Without calculating decimal values, predict which class has completed a greater fraction of their garden project. Explain your reasoning using what you know about fractions

work page
[38]

What do you notice when you compare them visually? Part 2: Mathematical Analysis

Create visual models (such as grids or bar diagrams) to represent both fractions. What do you notice when you compare them visually? Part 2: Mathematical Analysis

work page
[39]

Use at least two different mathematical methods to determine whether 4/100 < 4/96. For each method, explain: Why the method works mathematically, how it connects to the meaning of fractions, what it reveals about the relationship between numerators and denominators

work page
[40]

Explain how this method helps you understand why one fraction is larger than the other

Convert both fractions to equivalent fractions with the same denominator. Explain how this method helps you understand why one fraction is larger than the other. Part 3: Pattern Recognition and Generalization

work page
[41]

Consider the general case: If you have two fractions with the same numerator (like a/b and a/c), how can you determine which is larger without doing any calculations? Test your rule with other examples

work page
[42]

What does it mean for the denominator to be larger when the numerator stays the same?

Explain why your rule works by connecting it to the real-world garden context. What does it mean for the denominator to be larger when the numerator stays the same?

work page
[43]

Create your own comparison problem similar to this one and explain how someone could solve it using the insights you've developed

work page

[1] [1]

Introduction 1.1 The Challenge of Curriculum Adaptation Teachers rarely serve their students well by implementing curricula exactly as written. Contextual factors like time constraints, student readiness, available resources, and pacing demands require constant adaptation of instructional materials (Remillard, 2005; Zhou & Lo, 2025). In mathematics educat...

work page 2005

[2] [2]

This is time-intensive work that competes with teachers’ many other responsibilities

and determine how to modify those elements while maintaining mathematical integrity and coherence with learning goals (Boston & Smith, 2009; Tekkumru-Kisa & Stein, 2015). This is time-intensive work that competes with teachers’ many other responsibilities. 1.2 AI as a Potential Solution Artificial intelligence tools have shown promise in various education...

work page 2009

[3] [3]

Limited empirical evidence: Most studies of AI in education focus on student-facing intelligent tutoring (Hwang & Tu, 2021), or more generally, content generation, rather than strategic modification of existing materials

work page 2021

[4] [4]

In humans, the ability to classify correctly is considered foundational to generating or modifying correctly

Unclear relationship to classification ability: AIs appear to vary substantially in their ability to classify the cognitive demand of given tasks (Fox et al., 2026). In humans, the ability to classify correctly is considered foundational to generating or modifying correctly. However, it is not clear that LLMs will also be limited in this way, given their ...

work page 2026

[5] [5]

Insufficient analysis of failure modes: When AI tools fail to modify tasks successfully, we do not know whether certain error types are more common (e.g., undershooting the target demand level, overshooting the target demand level, or whether they make other types of errors, such as failing to address other classroom needs)

work page

[6] [6]

Lack of tool comparison: We do not know the extent of performance variation among currently available AI tools, particularly whether education-specific AI tools will outperform general-purpose AI tools for task modification. 1.4 Theoretical Framework This study employs the Task Analysis Guide (TAG) developed by Stein and Smith (1998), a research-backed fr...

work page 1998

[7] [7]

Memorization: Tasks that require reproduction of previously learned facts, rules, formulas, or definitions without connections to concepts or meaning (e.g., What are the decimal and percent equivalents for the fractions ½ & ¼?)

work page

[8] [8]

Show your work.) High Cognitive Demand:

Procedures without Connections: Tasks that focus on producing correct answers using procedures without developing conceptual understanding (e.g., Convert the fraction 3/8 to a decimal and a percent. Show your work.) High Cognitive Demand:

work page

[9] [9]

Procedures with Connections: Tasks that focus students’ attention on the use of procedures for the purpose of developing a deeper understanding of mathematical concepts and ideas (e.g., Using a 10 x 10 grid, identify the decimal and percent equivalents of 3/8.)

work page

[10] [10]

Doing Mathematics : Tasks that require complex, non -algorithmic thinking and demand self - monitoring of one’s own cognitive processes (e.g., Shade 6 of the small squares in the rectangle below. Using the diagram, explain how to determine each of the following: • the percent of the area that is shaded • the decimal part of the area that is shaded • the f...

work page 2022

[11] [11]

Can current general and specialized education AI tools successfully modify low-cognitive-demand mathematics tasks to meet the criteria for Procedures with Connections?

work page

[12] [12]

How does modification success vary across different AI tools?

work page

[13] [13]

Procedures without Connections)?

How does modification success vary across different types of low-demand tasks (Memorization vs. Procedures without Connections)?

work page

[14] [14]

What is the relation between an AI tool's ability to accurately classify cognitive demand and its ability to accurately modify tasks?

work page

[15] [15]

overshooting) characterize unsuccessful modification attempts?

What patterns of failure (undershooting vs. overshooting) characterize unsuccessful modification attempts?

work page

[16] [16]

That study tested the ability of 11 AI tools to classify mathematics tasks using the Task Analysis Guide

Methods 2.1 Building on Prior Classification Work 2.1.1 Methods This study builds directly on our prior investigation of AI tools’ ability to classify cognitive demand (Fox, et al., 2026). That study tested the ability of 11 AI tools to classify mathematics tasks using the Task Analysis Guide. The current study uses the same 11 tools and focuses on six ta...

work page 2026

[17] [17]

The Task Analysis Guide (TAG.docx)

work page

[18] [18]

The original task (Task X.docx)

work page

[19] [19]

Modify this task so that it meets the Procedures with Connections classification

A standardized modification prompt Standardized Prompt: "Based on the TAG.docx I uploaded, Task [X].docx is considered a lower cognitive demand task (Memorization/Procedures without Connections). Modify this task so that it meets the Procedures with Connections classification. Use the TAG.docx as the framework to modify the task. Give detailed reasoning a...

work page

[20] [20]

Show your work and explain how the division process relates to the meaning of the fraction

Use long division to convert each fraction to its decimal equivalent. Show your work and explain how the division process relates to the meaning of the fraction

work page

[21] [21]

Then, convert each decimal to a percent by explaining the connection between decimals and percentages

work page

[22] [22]

window dressing

Represent each fraction visually (e.g., with a pie chart or number line) and explain how this representation supports your conversions. Magic School’s modification of Task M1 failed to meet the criteria for Procedures with Connections and instead created a task that aligns with the criteria for Procedures without Connections. In particular, the modified t...

work page

[23] [29]

This modification is a clear overshoot by Coteach as it increased the cognitive demand level beyond the Procedures with Connections level into the Doing Mathematics classification

Create your own comparison problem similar to this one and explain how someone could solve it using the insights you've developed. This modification is a clear overshoot by Coteach as it increased the cognitive demand level beyond the Procedures with Connections level into the Doing Mathematics classification. It requires students to access their prior le...

work page

[24] [30]

Discussion 4.1 Interpretation of Findings The 64% overall success rate presents a nuanced picture of AI capabilities for task modification. This performance level suggests that AI tools can meaningfully engage with the challenge of elevating cognitive demand, but they are not yet reliable enough for autonomous deployment without human oversight. The subst...

work page 2026

[25] [31]

Conclusion This study provides the first systematic evaluation of AI tools' capacity to modify mathematics tasks to achieve higher cognitive demand. The 64% average success rate demonstrates that many current AI tools struggle with this pedagogical challenge, and the substantial variation across tools (33% – 83%) indicates that thoughtful tool selection i...

work page

[26] [32]

References Agyapong, B., Obuobi-Donkor, G., Burback, L., & Wei, Y . (2022). Stress, burnout, anxiety, and depression among teachers: A scoping review. International Journal of Environmental Research and Public Health, 19(17), 10706. https://doi.org/10.3390/ijerph191710706 Arena. (2026, April 1). Arena leaderboard: Compare & benchmark the best frontier AI ...

work page doi:10.3390/ijerph191710706 2022

[27] [33]

If the AI tool changed any of the core components during modification (e.g., the numbers used in the problems), then it was marked as a failure

Compare the original task to the modified task to ensure that the goal of the task (e.g., comparing fractions), numbers, and operations were consistent. If the AI tool changed any of the core components during modification (e.g., the numbers used in the problems), then it was marked as a failure

work page

[28] [34]

Use long division to solve

Note specific lines of text within the adapted task (e.g., "Use long division to solve... ”) that related to the criteria for Procedures with Connections tasks (e.g., “Suggest pathways to follow (explicitly or implicitly) that are broad general procedures... ”)

work page

[29] [35]

Note specific lines of the adapted task that connect to criteria across different levels of cognitive demand

Review the criteria for every cognitive demand level of the Task Analysis Guide. Note specific lines of the adapted task that connect to criteria across different levels of cognitive demand. For example, if the modified task is very cognitively demanding but does not provide an explicit or implicit solution pathway, then it aligns more with the first crit...

work page

[30] [36]

how many times bigger

Final judgment is made to decide if the adapted task meets, exceeds, or undershoots the goal of Procedures with Connections. After each reviewer had independently evaluated all their assigned modifications, the reviewers met to compare notes and evaluations of all 66 tasks and made a final determination together. If disagreement occurred among reviewers, ...

work page

[31] [37]

Explain your reasoning using what you know about fractions

Without calculating decimal values, predict which class has completed a greater fraction of their garden project. Explain your reasoning using what you know about fractions

work page

[32] [38]

What do you notice when you compare them visually? Part 2: Mathematical Analysis

Create visual models (such as grids or bar diagrams) to represent both fractions. What do you notice when you compare them visually? Part 2: Mathematical Analysis

work page

[33] [39]

Use at least two different mathematical methods to determine whether 4/100 < 4/96. For each method, explain: Why the method works mathematically, how it connects to the meaning of fractions, what it reveals about the relationship between numerators and denominators

work page

[34] [40]

Explain how this method helps you understand why one fraction is larger than the other

Convert both fractions to equivalent fractions with the same denominator. Explain how this method helps you understand why one fraction is larger than the other. Part 3: Pattern Recognition and Generalization

work page

[35] [41]

Consider the general case: If you have two fractions with the same numerator (like a/b and a/c), how can you determine which is larger without doing any calculations? Test your rule with other examples

work page

[36] [42]

What does it mean for the denominator to be larger when the numerator stays the same?

Explain why your rule works by connecting it to the real-world garden context. What does it mean for the denominator to be larger when the numerator stays the same?

work page

[37] [43]

Create your own comparison problem similar to this one and explain how someone could solve it using the insights you've developed

work page