TeamUp: Semantic Project Matching and Team Formation for Learning at Scale
Pith reviewed 2026-05-07 16:15 UTC · model grok-4.3
The pith
Semantic embeddings from language models can match students to projects and form cognitively diverse teams more effectively than traditional methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TeamUp applies semantic embeddings to compute cosine similarity between student profiles and project descriptions, then ranks matches using a hybrid algorithm that adds pedagogical constraints for difficulty level, domain preferences, and demand balancing. Teams are formed by selecting members that maximize variance across embeddings to ensure skill complementarity. In a virtual experiment with 250 student profiles and 60 project descriptions, this produced a mean cosine similarity of 0.74 versus 0.43 for baselines, placed 83 percent of students within one difficulty level versus 34 percent, created teams covering three or more technical areas in 82 percent of cases versus 41 percent, and.0.
What carries the argument
Hybrid ranking algorithm that combines cosine similarity from semantic embeddings with pedagogical constraints and uses embedding variance to ensure skill complementarity in team formation.
Load-bearing premise
Semantic embeddings from pretrained language models accurately capture and represent student skill levels and project requirements in a way that matches real cognitive fit and distributions.
What would settle it
A live deployment in an actual course comparing student learning outcomes, team performance ratings, and self-reported skill growth against a manual allocation control group would show no meaningful difference or worse results.
Figures
read the original abstract
Project-based learning improves student engagement and learning outcomes, yet allocating students to appropriately challenging projects while forming cognitively diverse teams remains difficult at scale. Traditional allocation methods (manual spreadsheets, preference surveys) can't construct the cognitively diverse teams that that collaborate cognitively. This mismatch perpetuates equity issues: high-performing students self-select visible projects while under-represented students face reduced access to opportunity. We propose TeamUp, a lightweight, embedding-based team-forming system designed to improve learning outcomes and equity in large-scale project-based courses. TeamUp uses semantic embeddings from pretrained language models to match students to projects aligned with their skill level. The system employs a hybrid ranking algorithm combining cosine similarity with pedagogical constraints (difficulty alignment, domain preferences, and demand balancing) to generate personalised and transparent recommendations. Beyond individual matching, TeamUp constructs cognitively diverse teams by modelling skill complementarity through embedding variance, ensuring teams possess well-distributed capabilities rather than homogeneous strengths. We evaluated TeamUp through a virtual experiment using 250 student profiles and 60 project descriptions. Results show: (1) substantially higher match quality (mean cosine similarity of 0.74 vs. 0.43); (2) better difficulty alignment (83% placed within one level vs. 34%); (3) more diverse teams (82% covering three or more technical areas vs. 41%); and (4) sub-second recommendation latency at operational costs under $0.10 per student.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TeamUp, an embedding-based system for matching students to projects and forming cognitively diverse teams in large-scale project-based courses. It uses pretrained language model embeddings to compute cosine similarity for skill alignment, combined with a hybrid ranking algorithm that incorporates pedagogical constraints such as difficulty alignment, domain preferences, and demand balancing. Teams are formed by maximizing embedding variance to promote skill complementarity. Evaluation is performed exclusively via a virtual experiment on 250 synthetically generated student profiles and 60 project descriptions, claiming improvements over an implied baseline in match quality (mean cosine similarity 0.74 vs. 0.43), difficulty alignment (83% vs. 34% within one level), team diversity (82% vs. 41% covering three or more areas), and operational efficiency (sub-second latency, <$0.10 per student).
Significance. If the core assumptions hold, TeamUp offers a lightweight, scalable approach to a persistent challenge in educational technology: equitable and cognitively effective team allocation at scale. The hybrid algorithm and use of embedding variance for diversity are practical strengths that could integrate into existing learning management systems with low cost. The work highlights equity issues in self-selection but its impact is constrained by the synthetic nature of the evaluation.
major comments (3)
- Evaluation section (virtual experiment): All headline results (cosine similarity 0.74 vs. 0.43, 83% vs. 34% difficulty alignment, 82% vs. 41% diversity) are obtained solely from 250 generated student profiles and 60 generated project descriptions. No real student data, self-reported skills, human ratings of cognitive fit, or measured learning outcomes are provided, leaving the claims that TeamUp improves learning outcomes and equity dependent on unverified assumptions about data realism and embedding validity as proxies.
- Results presentation: The reported metrics are given as point estimates without statistical tests, standard deviations, confidence intervals, or error bars. This makes it impossible to assess whether the observed differences (e.g., 0.74 vs. 0.43) are robust or could arise from the synthetic generation process itself.
- Method and assumptions: The central modeling choice—that pretrained LM embeddings accurately represent student skill levels, project requirements, and cognitive complementarity—is load-bearing for all quantitative claims but receives no validation (e.g., no correlation with human judgments, no ablation across embedding models, no sensitivity analysis on synthetic profile generation parameters).
minor comments (3)
- Abstract: Duplicate wording 'that that collaborate cognitively' should be corrected.
- Abstract and evaluation: The baseline method yielding 0.43/34%/41% is not described, preventing readers from understanding what the improvements are measured against.
- Method: Specify the exact pretrained language model(s) used, any preprocessing of profiles/descriptions, and the precise formulation of the hybrid ranking algorithm (including how constraints are weighted).
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We agree that the evaluation has limitations due to its synthetic nature and will revise the paper to better address these concerns by adding statistical analyses, sensitivity checks, and expanded discussion of assumptions and limitations. We respond to each major comment below.
read point-by-point responses
-
Referee: Evaluation section (virtual experiment): All headline results (cosine similarity 0.74 vs. 0.43, 83% vs. 34% difficulty alignment, 82% vs. 41% diversity) are obtained solely from 250 generated student profiles and 60 generated project descriptions. No real student data, self-reported skills, human ratings of cognitive fit, or measured learning outcomes are provided, leaving the claims that TeamUp improves learning outcomes and equity dependent on unverified assumptions about data realism and embedding validity as proxies.
Authors: We acknowledge the validity of this concern. The evaluation is indeed based solely on synthetic data, which was generated to simulate realistic student profiles and project descriptions based on typical computer science course structures. While this approach allows for controlled testing of the algorithm, it does limit the strength of claims about real-world learning outcomes and equity improvements. In the revised manuscript, we will add a dedicated subsection in the Evaluation and Limitations sections to discuss the assumptions underlying the synthetic data generation, the potential gaps in using embeddings as proxies for skills, and the need for future real-world studies with actual student data and human evaluations. We will also temper the claims in the abstract and conclusion to reflect this. revision: partial
-
Referee: Results presentation: The reported metrics are given as point estimates without statistical tests, standard deviations, confidence intervals, or error bars. This makes it impossible to assess whether the observed differences (e.g., 0.74 vs. 0.43) are robust or could arise from the synthetic generation process itself.
Authors: We agree that including measures of statistical significance and variability would strengthen the results. Since the data is synthetic, we can re-run the experiments with multiple generations or report variability. In the revision, we will include standard deviations for the metrics, compute confidence intervals, and perform appropriate statistical tests (such as paired t-tests) to show that the improvements are significant. We will also add error bars to relevant figures and discuss the robustness with respect to the synthetic generation process. revision: yes
-
Referee: Method and assumptions: The central modeling choice—that pretrained LM embeddings accurately represent student skill levels, project requirements, and cognitive complementarity—is load-bearing for all quantitative claims but receives no validation (e.g., no correlation with human judgments, no ablation across embedding models, no sensitivity analysis on synthetic profile generation parameters).
Authors: This is a fair point; the validity of the embeddings is a key assumption. We will revise the Method section to include an ablation study comparing at least two different embedding models (e.g., all-MiniLM-L6-v2 and a larger model like MPNet) to show consistency. We will also add a sensitivity analysis varying the parameters of the synthetic profile generator (e.g., skill distribution variance) and report how results change. Regarding human judgments, we note that correlating embeddings with human ratings would require additional data collection not available in this study, but we will cite relevant literature on the use of embeddings in educational matching and discuss this as a limitation. revision: partial
- Providing real student data, self-reported skills, human ratings of cognitive fit, or measured learning outcomes, as these would necessitate a separate IRB-approved study with actual participants, which is outside the scope of the current virtual experiment.
Circularity Check
No significant circularity; evaluation metrics independent of system internals
full rationale
The paper describes an engineering system (embedding-based matching plus constraints, variance-based diversity) and evaluates it empirically on separately generated synthetic profiles. Reported metrics (cosine similarity, difficulty-level percentages, technical-area coverage counts) are computed after the fact using definitions that do not appear in the system's ranking equations or data-generation process. No derivation, uniqueness theorem, or fitted parameter is invoked; the comparison to baseline simply shows that the chosen algorithm scores higher on the chosen external metrics. The evaluation therefore remains self-contained and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained language model embeddings capture semantic similarity between student skills and project descriptions sufficiently for matching and diversity measurement.
Reference graph
Works this paper leans on
-
[1]
David J. Abraham, Robert W. Irving, and David F. Manlove. 2007. Two Algorithms for the Student-Project Allocation Problem.Journal of Discrete Algorithms5, 1 (2007), 73–90
work page 2007
-
[2]
Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz
Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human- AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 1–13
work page 2019
-
[3]
Ryan S. Baker and Aaron Hawn. 2022. Algorithmic Bias in Education.Interna- tional Journal of Artificial Intelligence in Education32 (2022), 1052–1092
work page 2022
-
[4]
Albert Bandura. 1994. Self-Efficacy. InEncyclopedia of Human Behavior, V. S. Ramachaudran (Ed.). Vol. 4. Academic Press, New York, 71–81
work page 1994
-
[5]
Blumenfeld, Elliot Soloway, Ronald W
Phyllis C. Blumenfeld, Elliot Soloway, Ronald W. Marx, Joseph S. Krajcik, Mark Guzdial, and Annemarie Palincsar. 1991. Motivating Project-Based Learning: Sustaining the Doing, Supporting the Learning.Educational Psychologist26, 3–4 (1991), 369–398
work page 1991
-
[6]
Zou, Venkatesh Saligrama, and Adam T
Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. 2016. Man is to Computer Programmer as Woman is to Homemaker? De- biasing Word Embeddings. InAdvances in Neural Information Processing Systems, Vol. 29
work page 2016
-
[7]
Edward L. Deci and Richard M. Ryan. 2000. The “What” and “Why” of Goal Pursuits: Human Needs and the Self-Determination of Behavior.Psychological Inquiry11, 4 (2000), 227–268
work page 2000
-
[8]
Hendrik Drachsler, Katrien Verbert, Olga C. Santos, and Nikos Manouselis. 2015. Panorama of Recommender Systems to Support Learning. InRecommender Systems Handbook. Springer, Boston, MA, 421–451
work page 2015
-
[9]
Harper, Vinícius de Senna, Igor T
Paul R. Harper, Vinícius de Senna, Igor T. Vieira, and Arjan K. Shahani. 2005. A Genetic Algorithm for the Project Assignment Problem.Computers & Operations Research32, 5 (2005), 1255–1265
work page 2005
-
[10]
Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé III, Miro Dudik, and Hanna Wallach. 2019. Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need?. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 1–16
work page 2019
-
[11]
Zhuoren Jiang, Yao Zhang, and Xing Li. 2019. Course Recommendation with Learner and Course Embeddings. InProceedings of the 9th International Conference on Learning Analytics & Knowledge. ACM, New York, NY, USA, 46–55
work page 2019
-
[12]
David W. Johnson and Roger T. Johnson. 1991.Cooperative Learning: Increasing College Faculty Instructional Productivity. ASHE-ERIC Higher Education Report, Vol. 4. School of Education and Human Development, George Washington University
work page 1991
-
[13]
Aleksandra Klašnja-Milićević, Boban Vesin, Mirjana Ivanović, and Zoran Budimac
-
[14]
E-Learning Personalization Based on Hybrid Recommendation Strategy and Learning Style Identification.Computers & Education56, 3 (2011), 885–899
work page 2011
-
[15]
Theodoros Lappas, Kun Liu, and Evimaria Terzi. 2009. Finding a Team of Ex- perts in Social Networks. InProceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, 467–476
work page 2009
-
[16]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. InAdvances in Neural Information Processing Systems, Vol. 26
work page 2013
-
[17]
Julie E. Mills and David F. Treagust. 2003. Engineering Education—Is Problem- Based or Project-Based Learning the Answer?Australasian Journal of Engineering Education3, 2 (2003), 2–16
work page 2003
-
[18]
pgvector contributors. 2023. pgvector: Open-Source Vector Similarity Search for Postgres. https://github.com/pgvector/pgvector. Accessed: 2025-12-15
work page 2023
-
[19]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 3982–3992
work page 2019
- [20]
-
[21]
Katherine Y. Williams and Charles A. O’Reilly III. 1998. Demography and Diver- sity in Organizations: A Review of 40 Years of Research.Research in Organiza- tional Behavior20 (1998), 77–140
work page 1998
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.