arxiv: 2604.02677 · v1 · submitted 2026-04-03 · 💻 cs.HC · cs.CY

Beyond the AI Tutor: Social Learning with LLM Agents

Harsh Kumar , Zi Kang (Jace) Mu , Jonathan Vincentius , Ashton Anderson This is my paper

Pith reviewed 2026-05-13 19:14 UTC · model grok-4.3

classification 💻 cs.HC cs.CY

keywords LLM agentssocial learningAI tutoringmulti-agent systemseducational technologylearning outcomesessay writingproblem solving

0 comments

The pith

Combining an LLM tutor with LLM peers improves unassisted learning outcomes beyond what a single tutor provides.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether multi-agent LLM setups can deliver the collaborative benefits long shown in human learning research, such as peer modeling and exposure to varied perspectives. Most current AI education tools stick to one-on-one tutoring, but two experiments examine what happens when students also interact with LLM peers that make different kinds of errors. In math problem solving, the tutor-plus-peers group scored highest on a later test with no AI help. In essay writing, only the two-agent setup kept idea diversity from collapsing the way single-LLM assistance did.

Core claim

Participants who worked with both an LLM tutor and LLM peers reached the highest accuracy on unassisted SAT-style math problems, while in argumentative and creative writing tasks only the condition with two distinct LLMs avoided the reduction in idea-level variety produced by single-model assistance.

What carries the argument

Multi-agent LLM configurations that add peer agents making distinct conceptual or arithmetic errors alongside a tutor agent.

If this is right

In convergent problem-solving tasks, adding LLM peers to a tutor produces the largest post-interaction accuracy lift.
In divergent writing tasks, two-agent setups maintain broader idea distributions where single-agent setups do not.
Design of AI learning tools can move from dyadic tutoring toward configurations that simulate observational and co-constructive benefits.
Error diversity across agents appears to support the observed advantages in both domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Classroom-scale deployments might combine several specialized agents to approximate group discussion without increasing human teacher load.
Future designs could test whether the same multi-agent pattern improves outcomes in domains such as coding or scientific reasoning.
If the pattern holds, platforms may need new interfaces that let learners choose which agents to consult rather than defaulting to a single model.

Load-bearing premise

The measured gains come specifically from multi-party social-learning processes rather than from simply receiving more total AI output, particular error patterns, or laboratory demand effects.

What would settle it

A replication that matches total AI exposure time across conditions but removes the peer-interaction element and still finds equivalent gains would falsify the claim that multi-party mechanisms are responsible.

Figures

Figures reproduced from arXiv: 2604.02677 by Ashton Anderson, Harsh Kumar, Jonathan Vincentius, Zi Kang (Jace) Mu.

**Figure 1.** Figure 1: Experimental procedure and conditions for Experiment-1. Participants first go through a random topic-selection step [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Example lesson-phase interactions from Experiment-1. Left: In the Peers Only condition, Alice (arithmetic errors) and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Test accuracy by lesson support condition in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Post-study perceptions in Experiment-1. Panels show (left to right) perceived difficulty (1= [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Post-survey perceptions of each agent across four qualities in Experiment-1. Participants rated agents they interacted [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Example writing-phase interactions from Experiment 2. Left: In the Single condition, a participant asks ChatGPT for [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Primary outcomes of Experiment-2. (a) Both LLM conditions improved essay quality over Control, with no significant [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Post-study perceptions in Experiment-2 (collaborative writing). Panels show (left to right) independent writing [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Perceptions of the writing support agents by LLM conditions. Participants rated the agent(s) on competence, warmth, [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

read the original abstract

Most AI-based educational tools today adopt a one-on-one tutoring paradigm, pairing a single LLM with a single learner. Yet decades of learning science research suggest that multi-party interaction -- through peer modeling, co-construction, and exposure to diverse perspectives -- can produce learning benefits that dyadic tutoring alone cannot. In this paper, we investigate whether multi-agent LLM configurations can enhance learning outcomes beyond what a single LLM tutor provides. We present two controlled experiments spanning distinct learning contexts. In a convergent problem-solving study ($N=315$), participants tackle SAT-level math problems in a 2$\times$2 design that varies the presence of an LLM tutor and LLM peers, each making different kinds of errors (conceptual vs.\ arithmetic); participants who interacted with both a tutor and peers achieved the highest unassisted test accuracy. In a divergent composition study ($N=247$), participants write argumentative and creative essays with either no AI assistance, a single LLM (Claude or ChatGPT), or both Claude and ChatGPT together; while both LLM conditions improved essay quality, only the two-agent condition avoided the idea-level homogeneity that single-model assistance was found to produce. Together, these studies offer one of the first controlled investigations of multi-agent LLM learning environments, probing whether the move from one-on-one AI tutoring toward richer agent configurations can unlock the collaborative and observational benefits long documented in human social learning research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper runs two experiments showing multi-agent LLM setups beat single-tutor baselines on math accuracy and essay diversity, but the designs mix social mechanisms with extra total AI exposure so the mechanism claims stay underdetermined.

read the letter

The punchline is that participants got the best unassisted math scores when both a tutor and peers were present, and essays stayed more diverse when two different LLMs were used instead of one. Those are the directional results from the N=315 and N=247 studies. The work is new in running a clean 2x2 contrast for convergent problem-solving and a dual-model condition for divergent writing, both framed against long-standing findings from human collaborative learning. It does a solid job setting up real tasks, recruiting decent sample sizes, and showing that single-model assistance can produce homogeneity while the two-agent version does not. The authors also pick error types that differ across agents, which is a reasonable way to test diversity. The main soft spot is the confound the stress-test flags: the both-present arm supplies more total LLM outputs, turns, and error instances than the single-factor arms, so any gain could come from volume or coverage rather than peer modeling or co-construction. The abstract gives no statistical tests, effect sizes, exclusion rules, or interaction details, which leaves the central claims hard to evaluate without the full methods. The divergent study has the same issue when it contrasts one model versus two without equating total generated content. This paper is for HCI and AI-education researchers who want early evidence on moving past dyadic tutors. A reader looking for practical design ideas or a starting point for follow-up experiments will get value from the conditions and the tie to social-learning literature. It deserves a serious referee because the questions are timely, the setups are controlled enough to be worth refining, and the directional patterns are worth checking with tighter controls. I would send it to review with requests for full stats, exposure-matched conditions, and explicit discussion of alternative explanations.

Referee Report

3 major / 2 minor

Summary. The paper claims that multi-agent LLM configurations can enhance learning beyond single-tutor setups. In a convergent 2x2 experiment (N=315) on SAT math problems, participants with both an LLM tutor and LLM peers (making distinct conceptual vs. arithmetic errors) achieved the highest unassisted test accuracy. In a divergent study (N=247) on argumentative and creative essays, both single- and dual-LLM conditions improved quality over no assistance, but only the two-agent condition (Claude + ChatGPT) avoided the idea-level homogeneity observed with single models.

Significance. If the central claims survive controls for exposure volume and proper statistical reporting, the work supplies one of the first controlled empirical tests of multi-party LLM learning environments. It directly links decades of social-learning research (peer modeling, co-construction, perspective diversity) to concrete agent configurations, offering a falsifiable path from dyadic tutoring to richer multi-agent setups.

major comments (3)

[Convergent problem-solving study methods] Convergent study methods (2x2 design): the both-present arm necessarily supplies more total LLM outputs, dialogue turns, and error instances than the single-factor arms. Because the abstract already notes that peers produce distinct error types, any accuracy gain could arise from cumulative exposure or error coverage rather than from social mechanisms such as peer modeling or co-construction. No equating of total generated content across cells is described.
[Results] Results reporting (both studies): the abstract and summary state directional outcomes (highest accuracy in both condition; homogeneity avoided only in two-agent condition) but supply no statistical tests, effect sizes, confidence intervals, or exclusion criteria. Without these, the load-bearing claims cannot be evaluated for reliability or practical significance.
[Divergent composition study methods] Divergent study design: the single-LLM vs. two-LLM contrast likewise does not equate total generated tokens or interaction volume. The homogeneity finding is therefore underdetermined with respect to whether it stems from model diversity or simply from receiving two independent generations.

minor comments (2)

[Methods] Clarify whether multi-party interactions are synchronous (real-time multi-agent chat) or sequential single-agent turns; this distinction is central to the social-learning interpretation.
[Methods] Add explicit power analysis or justification for the chosen sample sizes (N=315, N=247) given the 2x2 and three-arm designs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify the interpretability of our findings on multi-agent LLM learning environments. We address each major comment below and indicate revisions incorporated into the updated manuscript.

read point-by-point responses

Referee: [Convergent problem-solving study methods] Convergent study methods (2x2 design): the both-present arm necessarily supplies more total LLM outputs, dialogue turns, and error instances than the single-factor arms. Because the abstract already notes that peers produce distinct error types, any accuracy gain could arise from cumulative exposure or error coverage rather than from social mechanisms such as peer modeling or co-construction. No equating of total generated content across cells is described.

Authors: We acknowledge that the both-present condition involves greater total interaction volume by design. This configuration was chosen to test the combined presence of tutor and peer agents as they would occur in a realistic multi-party setting, consistent with social learning theory on peer modeling and perspective diversity. To address the potential confound, we have added a supplementary analysis that equates total LLM tokens and turns by subsampling the both-present interactions to match the single-agent arms; the accuracy advantage for the combined condition remains statistically reliable. We have also added explicit reporting of average tokens, turns, and error instances per cell in the revised methods section. revision: yes
Referee: [Results] Results reporting (both studies): the abstract and summary state directional outcomes (highest accuracy in both condition; homogeneity avoided only in two-agent condition) but supply no statistical tests, effect sizes, confidence intervals, or exclusion criteria. Without these, the load-bearing claims cannot be evaluated for reliability or practical significance.

Authors: We agree that the original submission insufficiently highlighted inferential statistics. The full manuscript contains the complete statistical reporting, including 2x2 ANOVA results with interaction effects, post-hoc comparisons, effect sizes, and 95% confidence intervals for both studies, as well as participant exclusion criteria based on attention checks and completion time. We have revised the abstract to include the key statistical outcomes and added a summary table of all inferential tests to the main text for clarity. revision: yes
Referee: [Divergent composition study methods] Divergent study design: the single-LLM vs. two-LLM contrast likewise does not equate total generated tokens or interaction volume. The homogeneity finding is therefore underdetermined with respect to whether it stems from model diversity or simply from receiving two independent generations.

Authors: We recognize that receiving two generations could contribute to reduced homogeneity independent of model differences. In the revised manuscript we now report average token counts per condition and include an additional control analysis comparing the two-model condition against a single-model condition prompted to generate two independent responses. This analysis indicates that cross-model diversity contributes to the observed reduction in idea-level homogeneity beyond volume alone. We have expanded the limitations section to discuss this distinction. revision: partial

Circularity Check

0 steps flagged

Empirical study with no derivations or self-referential predictions

full rationale

The paper reports two controlled human-subject experiments (N=315 convergent math problem-solving; N=247 divergent essay composition) that compare learning outcomes across conditions varying the presence of LLM tutor and/or LLM peers. No equations, fitted parameters, uniqueness theorems, or derivation chains appear in the reported work. Outcomes are measured directly via unassisted test accuracy and essay quality metrics; no step reduces a 'prediction' to a quantity defined by the authors' own modeling choices or prior self-citations. The central claims rest on experimental contrasts rather than any self-definitional or load-bearing self-citation structure. This is the expected finding for a purely empirical HCI study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical HCI study; no free parameters, no invented entities, and only standard statistical assumptions about random assignment and outcome measurement.

pith-pipeline@v0.9.0 · 5553 in / 1052 out tokens · 30542 ms · 2026-05-13T19:14:48.589412+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages

[1]

Teresa M Amabile. 1983. The social psychology of creativity: A componential conceptualization.Journal of personality and social psychology45, 2 (1983), 357

work page 1983
[2]

Barrett R Anderson, Jash Hemant Shah, and Max Kreminski. 2024. Homogeniza- tion effects of large language models on human creative ideation. InProceedings of the 16th conference on creativity & cognition. 413–425

work page 2024
[3]

John R Anderson, Albert T Corbett, Kenneth R Koedinger, and Ray Pelletier. 1995. Cognitive tutors: Lessons learned.The journal of the learning sciences4, 2 (1995), 167–207

work page 1995
[4]

2010.Argumentation in higher education

Richard Andrews. 2010.Argumentation in higher education. Routledge London

work page 2010
[5]

Joshua Ashkinaze, Julia Mendelsohn, Li Qiwei, Ceren Budak, and Eric Gilbert

work page
[6]

InProceedings of the ACM collective intelligence conference

How AI ideas affect the creativity, diversity, and evolution of human ideas: evidence from a large, dynamic experiment. InProceedings of the ACM collective intelligence conference. 198–213

work page
[7]

Albert Bandura. 1977. Social learning theory.Englewood Cliffs(1977)

work page 1977
[8]

Albert Bandura. 1978. The self system in reciprocal determinism.American psychologist33, 4 (1978), 344

work page 1978
[9]

Hamsa Bastani, Osbert Bastani, Alp Sungu, Haosen Ge, Ozge Kabakcı, and Rei Mariman. 2024. Generative ai can harm learning.A vailable at SSRN4895486 (2024)

work page 2024
[10]

2013.The psychology of written composi- tion

Carl Bereiter and Marlene Scardamalia. 2013.The psychology of written composi- tion. Routledge

work page 2013
[11]

Robert A Bjork. 1994. Memory and metamemory considerations in the training of human beings.Metacognition: Knowing about knowing185, 7.2 (1994), 185–205

work page 1994
[12]

Paul Black and Dylan Wiliam. 1998. Assessment and classroom learning.Assess- ment in Education: principles, policy & practice5, 1 (1998), 7–74

work page 1998
[13]

Benjamin S Bloom. 1984. The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring.Educational researcher13, 6 (1984), 4–16

work page 1984
[14]

2000.How people learn

John D Bransford, Ann L Brown, Rodney R Cocking, et al. 2000.How people learn. Vol. 11. Washington, DC: National academy press

work page 2000
[15]

Jerome Bruner. 1991. The narrative construction of reality.Critical inquiry18, 1 (1991), 1–21

work page 1991
[16]

Jaime R Carbonell. 1970. AI in CAI: An artificial-intelligence approach to computer-assisted instruction.IEEE transactions on man-machine systems11, 4 (1970), 190–202

work page 1970
[17]

Justine Cassell, Mike Ananny, Anindita Basu, Timothy Bickmore, P Chong, D Mellis, Kimiko Ryokai, Jennifer Smith, H Vilhjálmsson, and Hao Yan. 2000. Shared reality: Physical collaboration with a virtual peer. InCHI’00 extended abstracts on Human factors in computing systems. 259–260

work page 2000
[18]

Seth Chaiklin et al . 2003. The zone of proximal development in Vygotsky’s analysis of learning and instruction.Vygotsky’s educational theory in cultural context1, 2 (2003), 39–64

work page 2003
[19]

Sourish Chaudhuri, Rohit Kumar, Iris Howley, and Carolyn Penstein Rosé. 2009. Engaging collaborative learners with helping agents. InArtificial intelligence in education. Ios Press, 365–372

work page 2009
[20]

Myra Cheng, Alicia DeVrio, Lisa Egede, Su Lin Blodgett, and Alexandra Olteanu

work page
[21]

I Am the One and Only, Your Cyber BFF

" I Am the One and Only, Your Cyber BFF": Understanding the Impact of GenAI Requires Understanding the Impact of Anthropomorphic AI.arXiv preprint arXiv:2410.08526(2024)

work page arXiv 2024
[22]

Michelene TH Chi, Marguerite Roy, and Robert GM Hausmann. 2008. Observing tutorial dialogues collaboratively: Insights about human tutoring effectiveness from vicarious learning.Cognitive science32, 2 (2008), 301–341

work page 2008
[23]

Arthur Cropley. 2006. In praise of convergent thinking.Creativity research journal18, 3 (2006), 391–404

work page 2006
[24]

Scott A Crossley, David Allen, and Danielle S McNamara. 2012. Text simplification and comprehensible input: A case for an intuitive approach.Language Teaching Research16, 1 (2012), 89–108

work page 2012
[25]

Wesley Hanwen Deng, Sunnie SY Kim, Akshita Jha, Ken Holstein, Motahhare Eslami, Lauren Wilcox, and Leon A Gatys. 2025. Personateaming: Exploring how introducing personas can improve automated ai red-teaming.arXiv preprint arXiv:2509.03728(2025)

work page arXiv 2025
[26]

Alicia DeVrio, Myra Cheng, Lisa Egede, Alexandra Olteanu, and Su Lin Blodgett

work page
[27]

InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems

A taxonomy of linguistic expressions that contribute to anthropomorphism of language technologies. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–18

work page 2025
[28]

Pierre Dillenbourg. 1999. What do you mean by collaborative learning? Collaborative-learning: Cognitive and computational approaches.(1999), 1–19

work page 1999
[29]

Tiffany D Do, Usama Bin Shafqat, Elsie Ling, and Nikhil Sarda. 2025. PAIGE: Examining learning outcomes and experiences with personalized AI-generated educational podcasts. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–12

work page 2025
[30]

Sidney D’Mello and Art Graesser. 2012. Dynamics of affective states during complex learning.Learning and Instruction22, 2 (2012), 145–157

work page 2012
[31]

1998.Writing with power: Techniques for mastering the writing process

Peter Elbow. 1998.Writing with power: Techniques for mastering the writing process. Oxford University Press

work page 1998
[32]

2014.Children’s learning from educational television: Sesame Street and beyond

Shalom M Fisch. 2014.Children’s learning from educational television: Sesame Street and beyond. Routledge

work page 2014
[33]

Arthur C Graesser, Danielle S McNamara, and Max M Louwerse. 2003. What do readers need to learn in order to process coherence relations in narrative and expository text.Rethinking reading comprehension82 (2003), 98

work page 2003
[34]

Steve Graham and Dolores Perin. 2007. Writing next-effective strategies to improve writing of adolescents in middle and high schools

work page 2007
[35]

Joy Paul Guilford. 1967. The nature of human intelligence. (1967)

work page 1967
[36]

Andrea B Hollingshead. 2001. Cognitive interdependence and convergent expec- tations in transactive memory.Journal of personality and social psychology81, 6 (2001), 1080

work page 2001
[37]

Donald Horton and R Richard Wohl. 1956. Mass communication and para-social interaction: Observations on intimacy at a distance.psychiatry19, 3 (1956), 215–229

work page 1956
[38]

Humans welcome to observe

Yukun Jiang, Yage Zhang, Xinyue Shen, Michael Backes, and Yang Zhang. 2026. " Humans welcome to observe": A First Look at the Agent Social Network Moltbook. arXiv preprint arXiv:2602.10127(2026)

work page arXiv 2026
[39]

Irina Jurenka, Markus Kunesch, Kevin R McKee, Daniel Gillick, Shaojian Zhu, Sara Wiltberger, Shubham Milind Phal, Katherine Hermann, Daniel Kasenberg, Avishkar Bhoopchand, et al. 2024. Towards responsible development of generative AI for education: An evaluation-driven approach.arXiv preprint arXiv:2407.12687 (2024)

work page arXiv 2024
[40]

Manu Kapur. 2008. Productive failure.Cognition and instruction26, 3 (2008), 379–424

work page 2008
[41]

Manu Kapur. 2010. Productive failure in mathematical problem solving.Instruc- tional science38, 6 (2010), 523–550

work page 2010
[42]

2010.The Cambridge handbook of creativity

James C Kaufman and Robert J Sternberg. 2010.The Cambridge handbook of creativity. Cambridge University Press

work page 2010
[43]

2000.Explanation and cognition

Frank C Keil and Robert Andrew Wilson. 2000.Explanation and cognition. MIT press. Social Learning with LLM Agents Working Paper, March 2026, Toronto, Canada

work page 2000
[44]

I’m Not Sure, But

Sunnie SY Kim, Q Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, and Jen- nifer Wortman Vaughan. 2024. " I’m Not Sure, But... ": Examining the Impact of Large Language Models’ Uncertainty Expression on User Reliance and Trust. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Trans- parency. 822–835

work page 2024
[45]

Sunnie SY Kim, Jennifer Wortman Vaughan, Q Vera Liao, Tania Lombrozo, and Olga Russakovsky. 2025. Fostering appropriate reliance on large language models: The role of explanations, sources, and inconsistencies. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–19

work page 2025
[46]

Kenneth R Koedinger, John R Anderson, William H Hadley, and Mary A Mark

work page
[47]

Intelligent tutoring goes to school in the big city.International Journal of Artificial Intelligence in Education8 (1997), 30–43

work page 1997
[48]

1991.The skills of argument

Deanna Kuhn. 1991.The skills of argument. Cambridge University Press

work page 1991
[49]

Harsh Kumar, David M Rothschild, Daniel G Goldstein, and Jake M Hofman

work page
[50]

Math education with large language models: peril or promise?A vailable at SSRN 4641653(2023)

work page 2023
[51]

Harsh Kumar, Jonathan Vincentius, Ewan Jordan, and Ashton Anderson. 2024. Human Creativity in the Age of LLMs: Randomized Experiments on Divergent and Convergent Thinking.arXiv preprint arXiv:2410.03703(2024)

work page arXiv 2024
[52]

Harsh Kumar, Ruiwei Xiao, Benjamin Lawson, Ilya Musabirov, Jiakai Shi, Xinyuan Wang, Huayin Luo, Joseph Jay Williams, Anna N Rafferty, John Stamper, et al

work page
[53]

InProceedings of the eleventh ACM conference on learning@ scale

Supporting self-reflection at scale with large language models: Insights from randomized field experiments in classrooms. InProceedings of the eleventh ACM conference on learning@ scale. 86–97

work page
[54]

Rohit Kumar and Carolyn P Rose. 2010. Architecture for building conversa- tional agents that support collaborative learning.IEEE Transactions on Learning Technologies4, 1 (2010), 21–34

work page 2010
[55]

Hao-Ping Hank Lee, Advait Sarkar, Lev Tankelevitch, Ian Drosos, Sean Rintel, Richard Banks, and Nicholas Wilson. 2025. The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers. (2025)

work page 2025
[56]

John D Lee and Katrina A See. 2004. Trust in automation: Designing for appro- priate reliance.Human factors46, 1 (2004), 50–80

work page 2004
[57]

Mina Lee, Percy Liang, and Qian Yang. 2022. Coauthor: Designing a human- ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–19

work page 2022
[58]

Krittaya Leelawong and Gautam Biswas. 2008. Designing learning by teaching agents: The Betty’s Brain system.International journal of artificial intelligence in education18, 3 (2008), 181–208

work page 2008
[59]

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for" mind" exploration of large language model society.Advances in Neural Information Processing Systems36 (2023), 51991–52008

work page 2023
[60]

Benjamin Lira, Todd Rogers, Daniel G Goldstein, Lyle Ungar, and Angela L Duckworth. 2025. Learning from examples: AI assistance can enhance rather than hinder skill development.arXiv preprint arXiv:2502.02880(2025)

work page arXiv 2025
[61]

Lena Mamykina, Bella Manoim, Manas Mittal, George Hripcsak, and Björn Hart- mann. 2011. Design lessons from the fastest q&a site in the west. InProceedings of the SIGCHI conference on Human factors in computing systems. 2857–2866

work page 2011
[62]

Marie-Louise Mares and Zhongdang Pan. 2013. Effects of Sesame Street: A meta- analysis of children’s learning in 15 countries.Journal of Applied Developmental Psychology34, 3 (2013), 140–151

work page 2013
[63]

Noboru Matsuda, Victoria Keiser, Rohan Raizada, Arthur Tu, Gabriel Stylianides, William W Cohen, and Kenneth R Koedinger. 2010. Learning by teaching SimStu- dent: Technical accomplishments and an initial use with students. InIntelligent Tutoring Systems: 10th International Conference, ITS 2010, Pittsburgh, PA, USA, June 14-18, 2010, Proceedings, Part I 10...

work page 2010
[64]

Steven Moore, Huy A Nguyen, Norman Bier, Tanvi Domadia, and John Stamper

work page
[65]

InEuropean conference on technology enhanced learning

Assessing the quality of student-generated short answer questions using GPT-3. InEuropean conference on technology enhanced learning. Springer, 243– 257

work page
[66]

Melissa M Nelson and Christian D Schunn. 2009. The nature of feedback: How different types of peer feedback affect writing performance.Instructional science 37, 4 (2009), 375–401

work page 2009
[67]

E Michael Nussbaum. 2011. Argumentation, dialogue theory, and probability modeling: Alternative frameworks for argumentation research in education. Educational Psychologist46, 2 (2011), 84–106

work page 2011
[68]

E Michael Nussbaum, CarolAnne M Kardash, and Steve Ed Graham. 2005. The effects of goal instructions and text on the generation of counterarguments during writing.Journal of educational psychology97, 2 (2005), 157

work page 2005
[69]

Benjamin D Nye, Arthur C Graesser, and Xiangen Hu. 2014. AutoTutor and family: A review of 17 years of natural language tutoring.International Journal of Artificial Intelligence in Education24 (2014), 427–469

work page 2014
[70]

Vishakh Padmakumar and He He. 2023. Does writing with language models reduce content diversity?arXiv preprint arXiv:2309.05196(2023)

work page arXiv 2023
[71]

Annemarie Sullivan Palincsar. 1984. Reciprocal Teaching: Working within the Zone of Proximal Development. (1984)

work page 1984
[72]

Zachary A Pardos and Shreya Bhandari. 2024. ChatGPT-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills. Plos one19, 5 (2024), e0304013

work page 2024
[73]

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22

work page 2023
[74]

Eyal Peer, Laura Brandimarte, Sonam Samat, and Alessandro Acquisti. 2017. Beyond the Turk: Alternative platforms for crowdsourcing behavioral research. Journal of experimental social psychology70 (2017), 153–163

work page 2017
[75]

Richard E Petty, John T Cacioppo, Richard E Petty, and John T Cacioppo. 1986. The elaboration likelihood model of persuasion. Springer

work page 1986
[76]

Jean Piaget. 1964. Cognitive development in children: Piaget.Journal of research in science teaching2, 3 (1964), 176–186

work page 1964
[77]

Rod D Roscoe and Michelene TH Chi. 2008. Tutor learning: The role of explaining and responding to questions.Instructional science36 (2008), 321–350

work page 2008
[78]

David M Rothschild, Markus M Mobius, Jake M Hofman, Eleanor Dillon, Daniel G Goldstein, Nicole Immorlica, Sonia Jaffe, Brendan Lucier, Aleksandrs Slivkins, and Matthew Vogel. 2026. The Agentic Economy.Commun. ACM69, 2 (2026), 39–42

work page 2026
[79]

Mark A Runco. 2025. Updating the standard definition of creativity to account for the artificial creativity of AI.Creativity Research Journal37, 1 (2025), 1–5

work page 2025
[80]

Mark A Runco and Garrett J Jaeger. 2012. The standard definition of creativity. Creativity research journal24, 1 (2012), 92–96

work page 2012

Showing first 80 references.