Should There be a Teacher In-the-Loop? A Study of Generative AI Personalized Tasks Middle School

Andrew Lan; Candace Walkington; Itffini Pruitt-Britton; Mingyu Feng; Theodora Beauchamp

arxiv: 2602.15876 · v1 · submitted 2026-02-02 · 💻 cs.CY · cs.AI

Should There be a Teacher In-the-Loop? A Study of Generative AI Personalized Tasks Middle School

Candace Walkington , Mingyu Feng , Itffini Pruitt-Britton , Theodora Beauchamp , Andrew Lan This is my paper

Pith reviewed 2026-05-16 07:51 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords generative AIpersonalized learningteacher in the loopmiddle school mathematicsChatGPTstudent interestscontext personalizationefficiency

0 comments

The pith

Having a teacher in the loop when using generative AI for personalizing middle school math tasks results in broader personalization than students prefer for specific popular culture references.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how middle school math teachers partner with ChatGPT to create personalized problems based on student interests. It finds that teacher involvement leads to personalization at a broad grain size, while students prefer smaller grain sizes with detailed popular culture references. Teachers invested effort in adjusting references and fixing problem depth or realism, varying how much ownership they gave the AI. Although teachers improved at crafting interesting problems over time, the process did not become notably more time-efficient. This highlights a potential mismatch in how personalization is enacted with versus without direct teacher oversight.

Core claim

When teachers use generative AI like ChatGPT to personalize curriculum tasks for their 7th-grade math students, the resulting problems are personalized at a relatively broad grain size. In contrast, students tend to prefer a smaller grain size involving specific popular culture references that interest them. Teachers spent considerable effort adjusting popular culture references and addressing issues with the depth or realism of generated problems, sometimes giving higher or lower levels of ownership to the AI. Teachers showed improvement in crafting interesting problems through this partnership, but the process did not appear to become particularly time efficient even as they learned from学生

What carries the argument

The teacher-in-the-loop prompting and modification process with ChatGPT for generating personalized math problems, which determines the grain size of interest-based adaptations.

If this is right

Personalization remains at a broad level, potentially reducing student engagement compared to finer-grained references.
Teachers must expend effort to correct AI-generated issues with realism and depth.
Improvement in problem quality occurs with practice, but time savings do not materialize.
Students receive tasks that may not fully match their interest in specific cultural references.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid systems could allow AI to generate fine-grained options for teacher review to balance specificity and accuracy.
Without a teacher, generative AI might produce more student-preferred specific references but risk introducing errors or inappropriate content.
Future designs might incorporate student feedback loops to adjust grain size dynamically.

Load-bearing premise

The observed differences in grain size and efficiency are caused mainly by the teacher being in the loop, not by the math topics, student backgrounds, or the specific AI model version.

What would settle it

A controlled comparison where teachers personalize the same tasks without any AI input and produce similarly broad grain sizes, or where AI alone produces fine-grained popular culture references matching student preferences, would test whether the teacher loop is the driver.

read the original abstract

Adapting instruction to the fine-grained needs of individual students is a powerful application of recent advances in large language models. These generative AI models can create tasks that correspond to students' interests and enact context personalization, enhancing students' interest in learning academic content. However, when there is a teacher in-the-loop creating or modifying tasks with generative AI, it is unclear how efficient this process might be, despite commercial generative AI tools' claims that they will save teachers time. In the present study, we teamed 7 middle school mathematics teachers with ChatGPT to create personalized versions of problems in their curriculum, to correspond to their students' interests. We look at the prompting moves teachers made, their efficiency when creating problems, and the reactions of their 521 7th grade students who received the personalized assignments. We find that having a teacher-in-the-loop results in generative AI-enhanced personalization being enacted at a relatively broad grain size, whereas students tend to prefer a smaller grain size where they receive specific popular culture references that interest them. Teachers spent a lot of effort adjusting popular culture references and addressing issues with the depth or realism of the problems generated, giving higher or lower levels of ownership to the generative AI. Teachers were able to improve in their ability to craft interesting problems in partnership with generative AI, but this process did not appear to become particularly time efficient as teachers learned and reflected on their students' data, iterating their approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Teacher-AI personalization in 7th-grade math yields broader grain sizes than students prefer, but without a no-teacher control arm the causal role of the teacher remains unclear.

read the letter

The main thing to know is that when seven middle school math teachers worked with ChatGPT to personalize curriculum problems for their 521 students, the outputs landed at a relatively broad grain size while students reacted more positively to finer-grained versions with specific popular culture references. The process also stayed time-intensive even as teachers got better at prompting and iterating based on student data. What the paper actually contributes is a set of concrete observations from real classrooms: the prompting moves teachers used, how they adjusted for ownership and realism, and the efficiency patterns across multiple rounds. That kind of grounded description of teacher-AI interaction is still thin in the literature, so the empirical record here is useful even if it does not introduce new theory. The study does a reasonable job documenting that teachers spent effort on references and depth fixes, and that efficiency gains were limited. The clearest limitation is the absence of a parallel condition with unedited ChatGPT outputs on the same problems. All the grain-size and preference data come from teacher-modified tasks, so it is hard to tell whether the broader grain size is mainly the teacher's doing or tied to the topics, the student demographics, or the model version itself. Student reactions are also measured only against the versions they received, which leaves open the possibility that the mismatch is post-hoc dissatisfaction rather than an independent preference for smaller grain sizes. The abstract is light on methods details such as coding reliability or statistical tests, though the full paper presumably fills some of that in. This work is aimed at researchers studying human-AI collaboration in K-12 settings and at tool designers who want to understand how teachers actually use these systems. Readers working on classroom personalization or teacher training will get practical value from the prompting examples and efficiency notes. It has enough real data and a clear practical question to deserve a serious referee, though revisions should tighten the methods section and address the control-condition gap if follow-up data can be added.

Referee Report

2 major / 1 minor

Summary. The manuscript reports an empirical study in which seven middle school mathematics teachers collaborated with ChatGPT to generate personalized versions of curriculum problems for 521 seventh-grade students, aligned with students' reported interests. The authors examine teachers' prompting strategies and ownership decisions, the time required to produce usable tasks, and students' reactions to the resulting assignments. They conclude that the presence of a teacher-in-the-loop produces personalization at a relatively broad grain size, while students express preference for finer-grained personalization that incorporates specific popular-culture references; teachers invested substantial effort refining AI outputs for realism and depth, demonstrated improvement across iterations, yet the process did not become markedly more time-efficient.

Significance. If the reported patterns are robust, the work supplies classroom-grounded evidence on the practical costs and benefits of human oversight in generative-AI personalization. The documented tension between teacher-driven breadth and student preference for specificity, together with the efficiency data, offers concrete guidance for the design of AI tools intended for K-12 settings and contributes to the literature on human-AI collaboration in education.

major comments (2)

[Methods] The study design contains no control condition in which students receive unedited ChatGPT outputs on the same curriculum items. All prompting moves, efficiency metrics, and student reactions (n=521) derive exclusively from teacher-modified tasks; consequently the claim that teacher-in-the-loop presence produces broader grain size cannot be isolated from curriculum-topic effects, student demographics, or model version. (Methods and Results sections)
[Results] The manuscript provides no information on data-collection protocols, statistical tests employed for efficiency or preference analyses, or inter-rater reliability for the qualitative coding of prompting moves and student reactions. These omissions are load-bearing for the central empirical claims about grain-size mismatch and efficiency trajectories. (Results section)

minor comments (1)

[Abstract] The abstract would benefit from explicit statement of the number of teachers, students, and problems generated, as well as a concise quantitative summary of the efficiency and preference findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and reporting of our study on teacher-AI collaboration for personalization. We address each major point below, noting revisions where appropriate.

read point-by-point responses

Referee: [Methods] The study design contains no control condition in which students receive unedited ChatGPT outputs on the same curriculum items. All prompting moves, efficiency metrics, and student reactions (n=521) derive exclusively from teacher-modified tasks; consequently the claim that teacher-in-the-loop presence produces broader grain size cannot be isolated from curriculum-topic effects, student demographics, or model version.

Authors: We agree that the study lacks a no-teacher control condition, as its design focused exclusively on documenting the teacher-in-the-loop workflow, prompting strategies, and resulting task characteristics. The observed broader grain size reflects patterns in the teacher-edited outputs (e.g., teachers broadening popular-culture references to maintain curriculum alignment and realism). We will revise the abstract, introduction, and results to frame this as a descriptive finding from the collaborative process rather than a causal claim about the 'presence' of the teacher-in-the-loop. This is a genuine limitation of the current design; adding a control would require a new study. revision: partial
Referee: [Results] The manuscript provides no information on data-collection protocols, statistical tests employed for efficiency or preference analyses, or inter-rater reliability for the qualitative coding of prompting moves and student reactions. These omissions are load-bearing for the central empirical claims about grain-size mismatch and efficiency trajectories.

Authors: We appreciate this feedback on reporting clarity. The Methods section describes the overall data collection (teacher logs, student surveys, and qualitative coding of prompts and reactions), but we will expand the Results section to explicitly state the statistical tests applied to efficiency metrics (e.g., time-on-task trends) and preference analyses, along with inter-rater reliability coefficients for the qualitative codes. These details will be added in the revision to strengthen the empirical claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical observations

full rationale

This is a purely observational empirical study of teacher-ChatGPT collaboration on 7th-grade math personalization. It reports prompting moves, time spent, problem adjustments, and student preference data (n=521) without any equations, model derivations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to inputs by construction. The grain-size mismatch finding is an interpretive summary of observed outputs and reactions rather than a tautological redefinition or statistical forcing. No load-bearing uniqueness theorems or ansatzes are invoked.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical education study with no mathematical models, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5577 in / 1070 out tokens · 50607 ms · 2026-05-16T07:51:56.420698+00:00 · methodology

Should There be a Teacher In-the-Loop? A Study of Generative AI Personalized Tasks Middle School

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)