Beyond the AI Tutor: Social Learning with LLM Agents
Pith reviewed 2026-05-13 19:14 UTC · model grok-4.3
The pith
Combining an LLM tutor with LLM peers improves unassisted learning outcomes beyond what a single tutor provides.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Participants who worked with both an LLM tutor and LLM peers reached the highest accuracy on unassisted SAT-style math problems, while in argumentative and creative writing tasks only the condition with two distinct LLMs avoided the reduction in idea-level variety produced by single-model assistance.
What carries the argument
Multi-agent LLM configurations that add peer agents making distinct conceptual or arithmetic errors alongside a tutor agent.
If this is right
- In convergent problem-solving tasks, adding LLM peers to a tutor produces the largest post-interaction accuracy lift.
- In divergent writing tasks, two-agent setups maintain broader idea distributions where single-agent setups do not.
- Design of AI learning tools can move from dyadic tutoring toward configurations that simulate observational and co-constructive benefits.
- Error diversity across agents appears to support the observed advantages in both domains.
Where Pith is reading between the lines
- Classroom-scale deployments might combine several specialized agents to approximate group discussion without increasing human teacher load.
- Future designs could test whether the same multi-agent pattern improves outcomes in domains such as coding or scientific reasoning.
- If the pattern holds, platforms may need new interfaces that let learners choose which agents to consult rather than defaulting to a single model.
Load-bearing premise
The measured gains come specifically from multi-party social-learning processes rather than from simply receiving more total AI output, particular error patterns, or laboratory demand effects.
What would settle it
A replication that matches total AI exposure time across conditions but removes the peer-interaction element and still finds equivalent gains would falsify the claim that multi-party mechanisms are responsible.
Figures
read the original abstract
Most AI-based educational tools today adopt a one-on-one tutoring paradigm, pairing a single LLM with a single learner. Yet decades of learning science research suggest that multi-party interaction -- through peer modeling, co-construction, and exposure to diverse perspectives -- can produce learning benefits that dyadic tutoring alone cannot. In this paper, we investigate whether multi-agent LLM configurations can enhance learning outcomes beyond what a single LLM tutor provides. We present two controlled experiments spanning distinct learning contexts. In a convergent problem-solving study ($N=315$), participants tackle SAT-level math problems in a 2$\times$2 design that varies the presence of an LLM tutor and LLM peers, each making different kinds of errors (conceptual vs.\ arithmetic); participants who interacted with both a tutor and peers achieved the highest unassisted test accuracy. In a divergent composition study ($N=247$), participants write argumentative and creative essays with either no AI assistance, a single LLM (Claude or ChatGPT), or both Claude and ChatGPT together; while both LLM conditions improved essay quality, only the two-agent condition avoided the idea-level homogeneity that single-model assistance was found to produce. Together, these studies offer one of the first controlled investigations of multi-agent LLM learning environments, probing whether the move from one-on-one AI tutoring toward richer agent configurations can unlock the collaborative and observational benefits long documented in human social learning research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that multi-agent LLM configurations can enhance learning beyond single-tutor setups. In a convergent 2x2 experiment (N=315) on SAT math problems, participants with both an LLM tutor and LLM peers (making distinct conceptual vs. arithmetic errors) achieved the highest unassisted test accuracy. In a divergent study (N=247) on argumentative and creative essays, both single- and dual-LLM conditions improved quality over no assistance, but only the two-agent condition (Claude + ChatGPT) avoided the idea-level homogeneity observed with single models.
Significance. If the central claims survive controls for exposure volume and proper statistical reporting, the work supplies one of the first controlled empirical tests of multi-party LLM learning environments. It directly links decades of social-learning research (peer modeling, co-construction, perspective diversity) to concrete agent configurations, offering a falsifiable path from dyadic tutoring to richer multi-agent setups.
major comments (3)
- [Convergent problem-solving study methods] Convergent study methods (2x2 design): the both-present arm necessarily supplies more total LLM outputs, dialogue turns, and error instances than the single-factor arms. Because the abstract already notes that peers produce distinct error types, any accuracy gain could arise from cumulative exposure or error coverage rather than from social mechanisms such as peer modeling or co-construction. No equating of total generated content across cells is described.
- [Results] Results reporting (both studies): the abstract and summary state directional outcomes (highest accuracy in both condition; homogeneity avoided only in two-agent condition) but supply no statistical tests, effect sizes, confidence intervals, or exclusion criteria. Without these, the load-bearing claims cannot be evaluated for reliability or practical significance.
- [Divergent composition study methods] Divergent study design: the single-LLM vs. two-LLM contrast likewise does not equate total generated tokens or interaction volume. The homogeneity finding is therefore underdetermined with respect to whether it stems from model diversity or simply from receiving two independent generations.
minor comments (2)
- [Methods] Clarify whether multi-party interactions are synchronous (real-time multi-agent chat) or sequential single-agent turns; this distinction is central to the social-learning interpretation.
- [Methods] Add explicit power analysis or justification for the chosen sample sizes (N=315, N=247) given the 2x2 and three-arm designs.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which help clarify the interpretability of our findings on multi-agent LLM learning environments. We address each major comment below and indicate revisions incorporated into the updated manuscript.
read point-by-point responses
-
Referee: [Convergent problem-solving study methods] Convergent study methods (2x2 design): the both-present arm necessarily supplies more total LLM outputs, dialogue turns, and error instances than the single-factor arms. Because the abstract already notes that peers produce distinct error types, any accuracy gain could arise from cumulative exposure or error coverage rather than from social mechanisms such as peer modeling or co-construction. No equating of total generated content across cells is described.
Authors: We acknowledge that the both-present condition involves greater total interaction volume by design. This configuration was chosen to test the combined presence of tutor and peer agents as they would occur in a realistic multi-party setting, consistent with social learning theory on peer modeling and perspective diversity. To address the potential confound, we have added a supplementary analysis that equates total LLM tokens and turns by subsampling the both-present interactions to match the single-agent arms; the accuracy advantage for the combined condition remains statistically reliable. We have also added explicit reporting of average tokens, turns, and error instances per cell in the revised methods section. revision: yes
-
Referee: [Results] Results reporting (both studies): the abstract and summary state directional outcomes (highest accuracy in both condition; homogeneity avoided only in two-agent condition) but supply no statistical tests, effect sizes, confidence intervals, or exclusion criteria. Without these, the load-bearing claims cannot be evaluated for reliability or practical significance.
Authors: We agree that the original submission insufficiently highlighted inferential statistics. The full manuscript contains the complete statistical reporting, including 2x2 ANOVA results with interaction effects, post-hoc comparisons, effect sizes, and 95% confidence intervals for both studies, as well as participant exclusion criteria based on attention checks and completion time. We have revised the abstract to include the key statistical outcomes and added a summary table of all inferential tests to the main text for clarity. revision: yes
-
Referee: [Divergent composition study methods] Divergent study design: the single-LLM vs. two-LLM contrast likewise does not equate total generated tokens or interaction volume. The homogeneity finding is therefore underdetermined with respect to whether it stems from model diversity or simply from receiving two independent generations.
Authors: We recognize that receiving two generations could contribute to reduced homogeneity independent of model differences. In the revised manuscript we now report average token counts per condition and include an additional control analysis comparing the two-model condition against a single-model condition prompted to generate two independent responses. This analysis indicates that cross-model diversity contributes to the observed reduction in idea-level homogeneity beyond volume alone. We have expanded the limitations section to discuss this distinction. revision: partial
Circularity Check
Empirical study with no derivations or self-referential predictions
full rationale
The paper reports two controlled human-subject experiments (N=315 convergent math problem-solving; N=247 divergent essay composition) that compare learning outcomes across conditions varying the presence of LLM tutor and/or LLM peers. No equations, fitted parameters, uniqueness theorems, or derivation chains appear in the reported work. Outcomes are measured directly via unassisted test accuracy and essay quality metrics; no step reduces a 'prediction' to a quantity defined by the authors' own modeling choices or prior self-citations. The central claims rest on experimental contrasts rather than any self-definitional or load-bearing self-citation structure. This is the expected finding for a purely empirical HCI study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Teresa M Amabile. 1983. The social psychology of creativity: A componential conceptualization.Journal of personality and social psychology45, 2 (1983), 357
work page 1983
-
[2]
Barrett R Anderson, Jash Hemant Shah, and Max Kreminski. 2024. Homogeniza- tion effects of large language models on human creative ideation. InProceedings of the 16th conference on creativity & cognition. 413–425
work page 2024
-
[3]
John R Anderson, Albert T Corbett, Kenneth R Koedinger, and Ray Pelletier. 1995. Cognitive tutors: Lessons learned.The journal of the learning sciences4, 2 (1995), 167–207
work page 1995
-
[4]
2010.Argumentation in higher education
Richard Andrews. 2010.Argumentation in higher education. Routledge London
work page 2010
-
[5]
Joshua Ashkinaze, Julia Mendelsohn, Li Qiwei, Ceren Budak, and Eric Gilbert
-
[6]
InProceedings of the ACM collective intelligence conference
How AI ideas affect the creativity, diversity, and evolution of human ideas: evidence from a large, dynamic experiment. InProceedings of the ACM collective intelligence conference. 198–213
-
[7]
Albert Bandura. 1977. Social learning theory.Englewood Cliffs(1977)
work page 1977
-
[8]
Albert Bandura. 1978. The self system in reciprocal determinism.American psychologist33, 4 (1978), 344
work page 1978
-
[9]
Hamsa Bastani, Osbert Bastani, Alp Sungu, Haosen Ge, Ozge Kabakcı, and Rei Mariman. 2024. Generative ai can harm learning.A vailable at SSRN4895486 (2024)
work page 2024
-
[10]
2013.The psychology of written composi- tion
Carl Bereiter and Marlene Scardamalia. 2013.The psychology of written composi- tion. Routledge
work page 2013
-
[11]
Robert A Bjork. 1994. Memory and metamemory considerations in the training of human beings.Metacognition: Knowing about knowing185, 7.2 (1994), 185–205
work page 1994
-
[12]
Paul Black and Dylan Wiliam. 1998. Assessment and classroom learning.Assess- ment in Education: principles, policy & practice5, 1 (1998), 7–74
work page 1998
-
[13]
Benjamin S Bloom. 1984. The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring.Educational researcher13, 6 (1984), 4–16
work page 1984
-
[14]
John D Bransford, Ann L Brown, Rodney R Cocking, et al. 2000.How people learn. Vol. 11. Washington, DC: National academy press
work page 2000
-
[15]
Jerome Bruner. 1991. The narrative construction of reality.Critical inquiry18, 1 (1991), 1–21
work page 1991
-
[16]
Jaime R Carbonell. 1970. AI in CAI: An artificial-intelligence approach to computer-assisted instruction.IEEE transactions on man-machine systems11, 4 (1970), 190–202
work page 1970
-
[17]
Justine Cassell, Mike Ananny, Anindita Basu, Timothy Bickmore, P Chong, D Mellis, Kimiko Ryokai, Jennifer Smith, H Vilhjálmsson, and Hao Yan. 2000. Shared reality: Physical collaboration with a virtual peer. InCHI’00 extended abstracts on Human factors in computing systems. 259–260
work page 2000
-
[18]
Seth Chaiklin et al . 2003. The zone of proximal development in Vygotsky’s analysis of learning and instruction.Vygotsky’s educational theory in cultural context1, 2 (2003), 39–64
work page 2003
-
[19]
Sourish Chaudhuri, Rohit Kumar, Iris Howley, and Carolyn Penstein Rosé. 2009. Engaging collaborative learners with helping agents. InArtificial intelligence in education. Ios Press, 365–372
work page 2009
-
[20]
Myra Cheng, Alicia DeVrio, Lisa Egede, Su Lin Blodgett, and Alexandra Olteanu
-
[21]
I Am the One and Only, Your Cyber BFF
" I Am the One and Only, Your Cyber BFF": Understanding the Impact of GenAI Requires Understanding the Impact of Anthropomorphic AI.arXiv preprint arXiv:2410.08526(2024)
-
[22]
Michelene TH Chi, Marguerite Roy, and Robert GM Hausmann. 2008. Observing tutorial dialogues collaboratively: Insights about human tutoring effectiveness from vicarious learning.Cognitive science32, 2 (2008), 301–341
work page 2008
-
[23]
Arthur Cropley. 2006. In praise of convergent thinking.Creativity research journal18, 3 (2006), 391–404
work page 2006
-
[24]
Scott A Crossley, David Allen, and Danielle S McNamara. 2012. Text simplification and comprehensible input: A case for an intuitive approach.Language Teaching Research16, 1 (2012), 89–108
work page 2012
- [25]
-
[26]
Alicia DeVrio, Myra Cheng, Lisa Egede, Alexandra Olteanu, and Su Lin Blodgett
-
[27]
InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems
A taxonomy of linguistic expressions that contribute to anthropomorphism of language technologies. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–18
work page 2025
-
[28]
Pierre Dillenbourg. 1999. What do you mean by collaborative learning? Collaborative-learning: Cognitive and computational approaches.(1999), 1–19
work page 1999
-
[29]
Tiffany D Do, Usama Bin Shafqat, Elsie Ling, and Nikhil Sarda. 2025. PAIGE: Examining learning outcomes and experiences with personalized AI-generated educational podcasts. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–12
work page 2025
-
[30]
Sidney D’Mello and Art Graesser. 2012. Dynamics of affective states during complex learning.Learning and Instruction22, 2 (2012), 145–157
work page 2012
-
[31]
1998.Writing with power: Techniques for mastering the writing process
Peter Elbow. 1998.Writing with power: Techniques for mastering the writing process. Oxford University Press
work page 1998
-
[32]
2014.Children’s learning from educational television: Sesame Street and beyond
Shalom M Fisch. 2014.Children’s learning from educational television: Sesame Street and beyond. Routledge
work page 2014
-
[33]
Arthur C Graesser, Danielle S McNamara, and Max M Louwerse. 2003. What do readers need to learn in order to process coherence relations in narrative and expository text.Rethinking reading comprehension82 (2003), 98
work page 2003
-
[34]
Steve Graham and Dolores Perin. 2007. Writing next-effective strategies to improve writing of adolescents in middle and high schools
work page 2007
-
[35]
Joy Paul Guilford. 1967. The nature of human intelligence. (1967)
work page 1967
-
[36]
Andrea B Hollingshead. 2001. Cognitive interdependence and convergent expec- tations in transactive memory.Journal of personality and social psychology81, 6 (2001), 1080
work page 2001
-
[37]
Donald Horton and R Richard Wohl. 1956. Mass communication and para-social interaction: Observations on intimacy at a distance.psychiatry19, 3 (1956), 215–229
work page 1956
-
[38]
Yukun Jiang, Yage Zhang, Xinyue Shen, Michael Backes, and Yang Zhang. 2026. " Humans welcome to observe": A First Look at the Agent Social Network Moltbook. arXiv preprint arXiv:2602.10127(2026)
-
[39]
Irina Jurenka, Markus Kunesch, Kevin R McKee, Daniel Gillick, Shaojian Zhu, Sara Wiltberger, Shubham Milind Phal, Katherine Hermann, Daniel Kasenberg, Avishkar Bhoopchand, et al. 2024. Towards responsible development of generative AI for education: An evaluation-driven approach.arXiv preprint arXiv:2407.12687 (2024)
-
[40]
Manu Kapur. 2008. Productive failure.Cognition and instruction26, 3 (2008), 379–424
work page 2008
-
[41]
Manu Kapur. 2010. Productive failure in mathematical problem solving.Instruc- tional science38, 6 (2010), 523–550
work page 2010
-
[42]
2010.The Cambridge handbook of creativity
James C Kaufman and Robert J Sternberg. 2010.The Cambridge handbook of creativity. Cambridge University Press
work page 2010
-
[43]
2000.Explanation and cognition
Frank C Keil and Robert Andrew Wilson. 2000.Explanation and cognition. MIT press. Social Learning with LLM Agents Working Paper, March 2026, Toronto, Canada
work page 2000
-
[44]
Sunnie SY Kim, Q Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, and Jen- nifer Wortman Vaughan. 2024. " I’m Not Sure, But... ": Examining the Impact of Large Language Models’ Uncertainty Expression on User Reliance and Trust. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Trans- parency. 822–835
work page 2024
-
[45]
Sunnie SY Kim, Jennifer Wortman Vaughan, Q Vera Liao, Tania Lombrozo, and Olga Russakovsky. 2025. Fostering appropriate reliance on large language models: The role of explanations, sources, and inconsistencies. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–19
work page 2025
-
[46]
Kenneth R Koedinger, John R Anderson, William H Hadley, and Mary A Mark
-
[47]
Intelligent tutoring goes to school in the big city.International Journal of Artificial Intelligence in Education8 (1997), 30–43
work page 1997
-
[48]
Deanna Kuhn. 1991.The skills of argument. Cambridge University Press
work page 1991
-
[49]
Harsh Kumar, David M Rothschild, Daniel G Goldstein, and Jake M Hofman
-
[50]
Math education with large language models: peril or promise?A vailable at SSRN 4641653(2023)
work page 2023
- [51]
-
[52]
Harsh Kumar, Ruiwei Xiao, Benjamin Lawson, Ilya Musabirov, Jiakai Shi, Xinyuan Wang, Huayin Luo, Joseph Jay Williams, Anna N Rafferty, John Stamper, et al
-
[53]
InProceedings of the eleventh ACM conference on learning@ scale
Supporting self-reflection at scale with large language models: Insights from randomized field experiments in classrooms. InProceedings of the eleventh ACM conference on learning@ scale. 86–97
-
[54]
Rohit Kumar and Carolyn P Rose. 2010. Architecture for building conversa- tional agents that support collaborative learning.IEEE Transactions on Learning Technologies4, 1 (2010), 21–34
work page 2010
-
[55]
Hao-Ping Hank Lee, Advait Sarkar, Lev Tankelevitch, Ian Drosos, Sean Rintel, Richard Banks, and Nicholas Wilson. 2025. The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers. (2025)
work page 2025
-
[56]
John D Lee and Katrina A See. 2004. Trust in automation: Designing for appro- priate reliance.Human factors46, 1 (2004), 50–80
work page 2004
-
[57]
Mina Lee, Percy Liang, and Qian Yang. 2022. Coauthor: Designing a human- ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–19
work page 2022
-
[58]
Krittaya Leelawong and Gautam Biswas. 2008. Designing learning by teaching agents: The Betty’s Brain system.International journal of artificial intelligence in education18, 3 (2008), 181–208
work page 2008
-
[59]
Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for" mind" exploration of large language model society.Advances in Neural Information Processing Systems36 (2023), 51991–52008
work page 2023
- [60]
-
[61]
Lena Mamykina, Bella Manoim, Manas Mittal, George Hripcsak, and Björn Hart- mann. 2011. Design lessons from the fastest q&a site in the west. InProceedings of the SIGCHI conference on Human factors in computing systems. 2857–2866
work page 2011
-
[62]
Marie-Louise Mares and Zhongdang Pan. 2013. Effects of Sesame Street: A meta- analysis of children’s learning in 15 countries.Journal of Applied Developmental Psychology34, 3 (2013), 140–151
work page 2013
-
[63]
Noboru Matsuda, Victoria Keiser, Rohan Raizada, Arthur Tu, Gabriel Stylianides, William W Cohen, and Kenneth R Koedinger. 2010. Learning by teaching SimStu- dent: Technical accomplishments and an initial use with students. InIntelligent Tutoring Systems: 10th International Conference, ITS 2010, Pittsburgh, PA, USA, June 14-18, 2010, Proceedings, Part I 10...
work page 2010
-
[64]
Steven Moore, Huy A Nguyen, Norman Bier, Tanvi Domadia, and John Stamper
-
[65]
InEuropean conference on technology enhanced learning
Assessing the quality of student-generated short answer questions using GPT-3. InEuropean conference on technology enhanced learning. Springer, 243– 257
-
[66]
Melissa M Nelson and Christian D Schunn. 2009. The nature of feedback: How different types of peer feedback affect writing performance.Instructional science 37, 4 (2009), 375–401
work page 2009
-
[67]
E Michael Nussbaum. 2011. Argumentation, dialogue theory, and probability modeling: Alternative frameworks for argumentation research in education. Educational Psychologist46, 2 (2011), 84–106
work page 2011
-
[68]
E Michael Nussbaum, CarolAnne M Kardash, and Steve Ed Graham. 2005. The effects of goal instructions and text on the generation of counterarguments during writing.Journal of educational psychology97, 2 (2005), 157
work page 2005
-
[69]
Benjamin D Nye, Arthur C Graesser, and Xiangen Hu. 2014. AutoTutor and family: A review of 17 years of natural language tutoring.International Journal of Artificial Intelligence in Education24 (2014), 427–469
work page 2014
- [70]
-
[71]
Annemarie Sullivan Palincsar. 1984. Reciprocal Teaching: Working within the Zone of Proximal Development. (1984)
work page 1984
-
[72]
Zachary A Pardos and Shreya Bhandari. 2024. ChatGPT-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills. Plos one19, 5 (2024), e0304013
work page 2024
-
[73]
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22
work page 2023
-
[74]
Eyal Peer, Laura Brandimarte, Sonam Samat, and Alessandro Acquisti. 2017. Beyond the Turk: Alternative platforms for crowdsourcing behavioral research. Journal of experimental social psychology70 (2017), 153–163
work page 2017
-
[75]
Richard E Petty, John T Cacioppo, Richard E Petty, and John T Cacioppo. 1986. The elaboration likelihood model of persuasion. Springer
work page 1986
-
[76]
Jean Piaget. 1964. Cognitive development in children: Piaget.Journal of research in science teaching2, 3 (1964), 176–186
work page 1964
-
[77]
Rod D Roscoe and Michelene TH Chi. 2008. Tutor learning: The role of explaining and responding to questions.Instructional science36 (2008), 321–350
work page 2008
-
[78]
David M Rothschild, Markus M Mobius, Jake M Hofman, Eleanor Dillon, Daniel G Goldstein, Nicole Immorlica, Sonia Jaffe, Brendan Lucier, Aleksandrs Slivkins, and Matthew Vogel. 2026. The Agentic Economy.Commun. ACM69, 2 (2026), 39–42
work page 2026
-
[79]
Mark A Runco. 2025. Updating the standard definition of creativity to account for the artificial creativity of AI.Creativity Research Journal37, 1 (2025), 1–5
work page 2025
-
[80]
Mark A Runco and Garrett J Jaeger. 2012. The standard definition of creativity. Creativity research journal24, 1 (2012), 92–96
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.