pith. sign in

arxiv: 2604.27433 · v1 · submitted 2026-04-30 · 💻 cs.HC

Beyond One-Size-Fits-All Exercises: Personalizing Computer Science Worksheets with Large Language Models

Pith reviewed 2026-05-07 10:27 UTC · model grok-4.3

classification 💻 cs.HC
keywords LLM personalizationCS1 educationadaptive learninglearner profilesscaffoldingregular expressionscompletion ratesmotivation
0
0 comments X

The pith

LLM-personalized computer science worksheets achieve near-universal completion and higher correctness for struggling students.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how large language models can generate personalized worksheets for an introductory computer programming course. Students were grouped into four profiles based on their knowledge and motivation levels using established educational theories. The study compared these adaptive materials against standard exercises on a regular expressions topic with over 400 participants. Personalized versions maintained completion rates above 99 percent for all groups, while standard exercises saw 25 to 30 percent of low-knowledge students fail to finish. Low knowledge and low motivation students also scored 18 percent higher on correctness with the tailored support, and all students viewed the tasks as similarly challenging.

Core claim

The authors demonstrate that tailoring instructional materials to learner profiles—derived from Bloom's Taxonomy for knowledge and Self-Determination Theory for motivation—using LLMs results in dramatically improved task completion and performance for at-risk students in CS1. Specifically, the personalized exercises served primarily to retain students who would otherwise abandon the work, without reducing the desirable level of difficulty as measured by student perceptions.

What carries the argument

Learner profiles that determine variations in scaffolding, explicitness, and tone for LLM-generated exercises, grounded in Bloom's Taxonomy and Self-Determination Theory.

If this is right

  • Standard exercises lead to high incompletion rates among low-knowledge learners.
  • Personalized support boosts correctness specifically for low-knowledge, low-motivation profiles.
  • Students focus on structural elements like logical sequence and pacing rather than motivational tone.
  • The adaptive approach preserves the challenge level of the tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This personalization could help reduce dropout rates in introductory CS courses more broadly.
  • Instructors might use such systems to handle large classes with diverse student needs efficiently.
  • Future work could test the approach on different topics or with automated profile assessment.

Load-bearing premise

The assumption that learner profiles can be accurately identified beforehand and that the large language model will consistently generate appropriate scaffolding and explicitness without introducing errors or biases.

What would settle it

If a follow-up experiment randomly assigns profiles or uses non-personalized outputs and still observes the same high completion rates, the claim that personalization drives the retention effect would be falsified.

Figures

Figures reproduced from arXiv: 2604.27433 by Franco Ortiz, Michael Liut, Runlong Ye.

Figure 1
Figure 1. Figure 1: Study workflow: 1. Pre-Test Profiling, 2. Personal view at source ↗
Figure 2
Figure 2. Figure 2: Task Completion by Learner Profile and Experiment view at source ↗
Figure 3
Figure 3. Figure 3: Student Performance by Learner Profile and Exper view at source ↗
read the original abstract

Large Language Models (LLMs) have been widely applied to student-facing educational tools, this work explores their use in supporting instructors by presenting a practical adaptation of the Framework for Adaptive Content using Educational Technology (FACET) system to generate personalized instructional materials for an Introduction to Computer Programming (CS1) course. We conducted a mixed-methods study with 409 first-year computer science (CS) students, focusing on regular expressions (RegEx). Students were assessed on their knowledge and motivation, classified into one of four learner profiles, and assigned either LLM-personalized (treatment) or standard non-adaptive (control) exercises. Personalized materials varied in scaffolding, instructional explicitness, and tone based on learner profiles grounded in Bloom's Taxonomy and Self-Determination Theory. Quantitative analysis reveals that standard exercises resulted in task incompletion among low-knowledge learners, with approximately 25-30% incompletion, whereas personalized materials sustained near-universal completion (>99%) across all profiles. While high-performing students experienced ceiling effects, Low Knowledge/Low Motivation students achieved significantly higher correctness (+18.2%) with personalized support. Survey data indicate that students prioritize structural scaffolding (logical sequence, difficulty pacing) over motivational tone and perceive the adaptive tasks as equally challenging as standard exercises. These findings suggest that learner-profile-driven LLM personalization primarily serves as a retention scaffold, preventing task abandonment among at-risk students without diminishing the task's "desirable difficulty". The results demonstrate that instructor-facing LLM systems can effectively close engagement gaps in CS1 by tailoring instructional explicitness to student needs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper describes a mixed-methods study with 409 CS1 students using LLMs to generate personalized regular expressions exercises based on four learner profiles derived from Bloom's Taxonomy and Self-Determination Theory. Students were assigned to personalized (treatment) or standard (control) conditions; quantitative results claim near-universal completion (>99%) with personalization versus 25-30% incompletion for low-knowledge students in the control, plus an +18.2% correctness gain for the Low Knowledge/Low Motivation profile, while survey data indicate students value structural scaffolding and perceive equivalent challenge.

Significance. If the central claims hold after addressing validation gaps, the work would demonstrate a practical, instructor-facing application of LLMs to close engagement gaps in introductory CS courses by adapting scaffolding and explicitness to learner profiles. This could inform scalable retention strategies without reducing desirable difficulty, with the mixed-methods design and focus on at-risk students as notable strengths.

major comments (3)
  1. [Abstract / Results] Abstract and Results section: The headline quantitative outcomes (completion rates >99%, +18.2% correctness) are presented without statistical details such as p-values, confidence intervals, error bars, exact per-profile sample sizes, or tests for confounds like self-selection; this is load-bearing for the claim that personalization drives retention and performance gains rather than non-specific factors.
  2. [Methods] Methods section: No validation is reported for the pre-study assessment instrument used to classify students into the four Bloom/SDT-derived profiles (e.g., test-retest reliability, inter-rater agreement, or predictive validity), which is essential to ensure effects are attributable to accurate profile assignment rather than misclassification.
  3. [Methods] Methods / LLM generation subsection: The manuscript contains no post-generation audit or quantitative check of LLM outputs for fidelity to the intended profile-specific scaffolding, explicitness, tone, or factual correctness in RegEx content; without this, observed benefits could arise from generic quality improvements instead of the targeted adaptation mechanism.
minor comments (2)
  1. [Abstract] The abstract would benefit from a brief statement of the exact classification procedure and any handling of edge cases in profile assignment.
  2. [Results] Figure or table captions for completion/correctness data should explicitly note the statistical tests applied and any multiple-comparison corrections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for strengthening the methodological transparency of our work. We address each major comment below and commit to revisions that enhance the rigor of the statistical reporting, instrument description, and LLM output validation without altering the core findings or design.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results section: The headline quantitative outcomes (completion rates >99%, +18.2% correctness) are presented without statistical details such as p-values, confidence intervals, error bars, exact per-profile sample sizes, or tests for confounds like self-selection; this is load-bearing for the claim that personalization drives retention and performance gains rather than non-specific factors.

    Authors: We agree that the abstract and results would benefit from expanded statistical detail to support the claims. The full results section already reports per-profile sample sizes (derived from the 409 participants) and notes the +18.2% correctness difference for the Low Knowledge/Low Motivation profile as significant, but we will revise both the abstract and results to explicitly include p-values, 95% confidence intervals, and error bars for key comparisons. Assignment to treatment/control occurred after profile classification via pre-assessment, with no self-selection into conditions; we will add explicit discussion of this randomization and any tested confounds (e.g., prior programming experience). These additions will be incorporated in the revised manuscript. revision: yes

  2. Referee: [Methods] Methods section: No validation is reported for the pre-study assessment instrument used to classify students into the four Bloom/SDT-derived profiles (e.g., test-retest reliability, inter-rater agreement, or predictive validity), which is essential to ensure effects are attributable to accurate profile assignment rather than misclassification.

    Authors: The assessment instrument was constructed by adapting validated items from established Bloom's Taxonomy and Self-Determination Theory scales in the literature, with classification thresholds set a priori based on prior CS1 studies. However, we did not report test-retest reliability or formal predictive validity testing within this study. In the revision, we will expand the Methods section to detail the item sources, pilot testing with a small CS1 cohort for face validity, and any internal consistency metrics (e.g., Cronbach's alpha where applicable). We will also explicitly note the absence of full predictive validity as a limitation and discuss how profile assignments aligned with observed performance patterns in the data. revision: partial

  3. Referee: [Methods] Methods / LLM generation subsection: The manuscript contains no post-generation audit or quantitative check of LLM outputs for fidelity to the intended profile-specific scaffolding, explicitness, tone, or factual correctness in RegEx content; without this, observed benefits could arise from generic quality improvements instead of the targeted adaptation mechanism.

    Authors: We acknowledge that a systematic post-generation audit strengthens attribution to the profile-specific adaptations. During development, prompts were iteratively engineered to enforce profile-based variations in scaffolding and explicitness, and a subset of outputs was manually inspected for RegEx accuracy and alignment. We will add a dedicated subsection in Methods describing the prompt templates, generation parameters, and results of a quantitative audit on a random sample of 100 generated exercises (e.g., percentage matching intended scaffolding levels, factual correctness rates, and inter-rater agreement on tone/explicitness). This audit will be performed and reported in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study with direct measurements

full rationale

The paper reports results from a mixed-methods experiment with 409 students who were assessed, profiled using Bloom's Taxonomy and Self-Determination Theory, assigned to LLM-generated personalized or standard RegEx exercises, and measured on completion (>99% vs 25-30%) and correctness (+18.2% for one profile). No equations, derivations, fitted parameters, or self-citation chains appear in the provided text; the central claims are grounded in observed outcomes rather than any reduction of results to author-defined inputs by construction. This is a standard empirical evaluation without the self-definitional or prediction-by-fit patterns that would trigger circularity flags.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper draws on two established educational theories and standard mixed-methods evaluation practices; no new free parameters, axioms, or invented entities are introduced beyond the application of LLMs to an existing adaptive-content framework.

axioms (1)
  • domain assumption Learner profiles derived from Bloom's Taxonomy and Self-Determination Theory can be reliably measured and used to guide instructional adaptation.
    Invoked when classifying students and generating profile-specific materials.

pith-pipeline@v0.9.0 · 5586 in / 1264 out tokens · 44340 ms · 2026-05-07T10:27:06.489800+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Mohammad Abolnejadian, Sharareh Alipour, and Kamyar Taeb. 2024. Leveraging ChatGPT for adaptive learning through personalized prompt-based instruction: A CS1 education case study. InExtended abstracts of the CHI conference on human factors in computing systems. 1–8

  2. [2]

    Hazem A Alrakhawi, Nurullizam Jamiat, and Samy S Abu-Naser. 2023. Intelligent tutoring systems in education: a systematic review of usage, tools, effects and evaluation.Journal of Theoretical and Applied Information Technology101, 4 (2023), 1205–1226

  3. [3]

    Anderson and David R

    Lorin W. Anderson and David R. Krathwohl (Eds.). 2001.A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives (complete edition ed.). Longman, New York

  4. [4]

    Anthropic. 2025. Introducing Claude Sonnet 4.5. https://www.anthropic.com/ news/claude-sonnet-4-5. Accessed: 2026-01-19

  5. [5]

    Theresa Beaubouef and John Mason. 2005. Why the high attrition rate for computer science students: some thoughts and observations.ACM SIGCSE Bulletin37, 2 (2005), 103–106

  6. [6]

    Robert A Bjork. 1994. Memory and metamemory considerations in the training of human beings.Metacognition: Knowing about knowing185, 7.2 (1994), 185–205

  7. [7]

    Jin Chen, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, Xingmei Wang, et al. 2024. When large language models meet personalization: Perspectives of challenges and opportunities.World Wide Web27, 4 (2024), 42

  8. [8]

    Edward L Deci and Richard M Ryan. 2008. Self-determination theory: A macrothe- ory of human motivation, development, and health.Canadian psychology/Psy- chologie canadienne49, 3 (2008), 182

  9. [9]

    Rodrigo Duran, Albina Zavgorodniaia, and Juha Sorva. 2022. Cognitive load the- ory in computing education research: A review.ACM Transactions on Computing Education (TOCE)22, 4 (2022), 1–27

  10. [10]

    Rita Garcia and Michelle Craig. 2025. 20 Years Later: A Replication Study on Teaching CS1 Concepts.ACM Transactions on Computing Education25, 2 (2025), 1–33

  11. [11]

    Jana Gonnermann-Müller, Jennifer Haase, Konstantin Fackeldey, and Sebas- tian Pokutta. 2025. FACET: Teacher-Centred LLM-Based Multi-Agent Systems- Towards Personalized Educational Worksheets. arXiv:2508.11401 [cs.HC]

  12. [12]

    Majeed Kazemitabaar, Runlong Ye, Xiaoning Wang, Austin Zachary Henley, Paul Denny, Michelle Craig, and Tovi Grossman. 2024. Codeaid: Evaluating a classroom deployment of an llm-based programming assistant that balances student and educator needs. InProceedings of the 2024 chi conference on human factors in computing systems. 1–20

  13. [13]

    Päivi Kinnunen and Lauri Malmi. 2006. Why students drop out CS1 course?. In Proceedings of the second international workshop on Computing education research. 97–108

  14. [14]

    Amruth N Kumar, Rajendra K Raj, Sherif G Aly, Monica D Anderson, Brett A Becker, Richard L Blumenthal, Eric Eaton, Susan L Epstein, Michael Goldweber, Pankaj Jalote, et al. 2024. Computer science curricula 2023

  15. [15]

    Harsh Kumar, Ilya Musabirov, Mohi Reza, Jiakai Shi, Xinyuan Wang, Joseph Jay Williams, Anastasia Kuzminykh, and Michael Liut. 2024. Guiding Students in Using LLMs in Supported Learning Environments: Effects on Interaction Dynamics, Learner Performance, Confidence, and Trust.Proceedings of the ACM on Human-Computer Interaction8, CSCW2 (2024), 1–30

  16. [16]

    Harsh Kumar, Ilya Musabirov, Joseph Jay Williams, and Michael Liut. 2023. Quickta: exploring the design space of using large language models to pro- vide support to students. Learning Analytics and Knowledge Conference 2023 (LAK’23)

  17. [17]

    Maureen J Lage, Glenn J Platt, and Michael Treglia. 2000. Inverting the classroom: A gateway to creating an inclusive learning environment.The journal of economic education31, 1 (2000), 30–43

  18. [18]

    Antti-Jussi Lakanen and Ville Isomöttönen. 2023. CS1: Intrinsic motivation, self-efficacy, and effort.Informatics in Education22, 4 (2023), 651–670

  19. [19]

    Evanfiya Logacheva, Arto Hellas, James Prather, Sami Sarsa, and Juho Leinonen

  20. [20]

    InProceedings of the 2024 ACM Conference on International Computing Education Research - Volume 1(Melbourne, VIC, Australia)(ICER ’24)

    Evaluating Contextually Personalized Programming Exercises Created with Generative AI. InProceedings of the 2024 ACM Conference on International Computing Education Research - Volume 1(Melbourne, VIC, Australia)(ICER ’24). 95–113

  21. [21]

    Louis G Michael, James Donohue, James C Davis, Dongyoon Lee, and Francisco Servant. 2019. Regexes are hard: Decision-making, difficulties, and risks in pro- gramming regular expressions. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 415–426

  22. [22]

    Ilya Musabirov, Mohi Reza, Haochen Song, Steven Moore, Pan Chen, Harsh Ku- mar, Tong Li, John Stamper, Norman Bier, Anna Rafferty, Thomas Price, Nina Deliu, Audrey Durand, Michael Liut, and Joseph Jay Williams. 2025. Platform- based Adaptive Experimental Research in Education: Lessons Learned from The Digital Learning Challenge. InProceedings of the 15th ...

  23. [23]

    Valeria Ramirez Osorio, Ido Ben Haim, Ahmed Ashraf, Mohammad Mahmoud, Peter Dixon, Bogdan Simion, Michael Liut, and Angela Zavaleta Bernuy. 2026. Investigating the Impact of Student Usage of Generative AI Tools in Computing Courses. InProceedings of the 31st ACM Conference on Innovation and Technology in Computer Science Education V. 1

  24. [24]

    Valeria Ramirez Osorio, Angela Zavaleta Bernuy, Bogdan Simion, and Michael Liut. 2025. Understanding the Impact of Using Generative AI Tools in a Database Course. InProceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1. 959–965

  25. [25]

    Teomara Rutherford, Hye Rin Lee, Austin Cory Bart, Andrew Rodrigues, and Megan Englert. 2026. How self-beliefs, values, and belonging change and relate with performance during introductory computer science.Computer Science Education36, 1 (2026), 166–202

  26. [26]

    Richard M. Ryan. 1982. Control and Information in the Intrapersonal Sphere: An Extension of Cognitive Evaluation Theory.Journal of Personality and Social Psychology43, 3 (1982), 450–461

  27. [27]

    Richard M Ryan and Edward L Deci. 2020. Intrinsic and extrinsic motivation from a self-determination theory perspective: Definitions, theory, practices, and future directions.Contemporary educational psychology61 (2020), 101860

  28. [28]

    Sahil Sharma, Puneet Mittal, Mukesh Kumar, and Vivek Bhardwaj. 2025. The role of large language models in personalized learning: a systematic review of educational impact.Discover Sustainability6, 1 (2025), 1–24

  29. [29]

    George Stockman, Paul Albee, Laura Dillon, and Jonathon Oleszkiewicz. 2004. CS1 and CS2 Programming Exams for Assessing Learning and Teaching. In2004 Annual Conference. 9–358

  30. [30]

    Roberto Suson and Eugenio A Ermac. 2020. Computer aided instruction to teach concepts in education.International Journal on Emerging Technologies(2020)

  31. [31]

    John Sweller. 1988. Cognitive load during problem solving: Effects on learning. Cognitive science12, 2 (1988), 257–285

  32. [32]

    University of Nebraska Center for Transformative Teaching. 2020. What is Mas- tery Grading? https://teaching.unl.edu/resources/alternative-grading/mastery- grading/ Accessed: 2025-01-16

  33. [33]

    Annapurna Vadaparty, Daniel Zingaro, David H Smith IV, Mounika Padala, Chris- tine Alvarado, Jamie Gorson Benario, and Leo Porter. 2024. Cs1-llm: Integrating llms into cs1 instruction. InProceedings of the 2024 on Innovation and Technology in Computer Science Education v. 1. 297–303

  34. [34]

    Smith IV, Mounika Padala, Christine Alvarado, Jamie Gorson Benario, and Leo Porter

    Annapurna Vadaparty, Daniel Zingaro, David H. Smith IV, Mounika Padala, Christine Alvarado, Jamie Gorson Benario, and Leo Porter. 2024. CS1-LLM: Integrating LLMs into CS1 Instruction(ITiCSE 2024). 297–303