pith. sign in

arxiv: 2605.03237 · v1 · submitted 2026-05-05 · 💻 cs.SE · cs.CL

TeamUp: Semantic Project Matching and Team Formation for Learning at Scale

Pith reviewed 2026-05-07 16:15 UTC · model grok-4.3

classification 💻 cs.SE cs.CL
keywords team formationsemantic matchingproject-based learningeducational technologyembedding modelscognitive diversitylarge-scale courses
0
0 comments X

The pith

Semantic embeddings from language models can match students to projects and form cognitively diverse teams more effectively than traditional methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that pretrained language model embeddings, paired with a hybrid ranking approach, can allocate students to suitably challenging projects and assemble teams with complementary skills. This matters because manual or survey-based methods at scale often produce homogeneous teams and leave under-represented students with fewer opportunities. If the approach holds, large project-based courses could deliver personalized assignments quickly and at low cost while increasing the range of skills within each team. The evaluation uses generated student profiles and project descriptions to demonstrate measurable gains in match quality, difficulty alignment, and team diversity.

Core claim

TeamUp applies semantic embeddings to compute cosine similarity between student profiles and project descriptions, then ranks matches using a hybrid algorithm that adds pedagogical constraints for difficulty level, domain preferences, and demand balancing. Teams are formed by selecting members that maximize variance across embeddings to ensure skill complementarity. In a virtual experiment with 250 student profiles and 60 project descriptions, this produced a mean cosine similarity of 0.74 versus 0.43 for baselines, placed 83 percent of students within one difficulty level versus 34 percent, created teams covering three or more technical areas in 82 percent of cases versus 41 percent, and.0.

What carries the argument

Hybrid ranking algorithm that combines cosine similarity from semantic embeddings with pedagogical constraints and uses embedding variance to ensure skill complementarity in team formation.

Load-bearing premise

Semantic embeddings from pretrained language models accurately capture and represent student skill levels and project requirements in a way that matches real cognitive fit and distributions.

What would settle it

A live deployment in an actual course comparing student learning outcomes, team performance ratings, and self-reported skill growth against a manual allocation control group would show no meaningful difference or worse results.

Figures

Figures reproduced from arXiv: 2605.03237 by Aditya Joshi, Basem Suleiman, Dhruv Gulwani, Sonit Singh.

Figure 1
Figure 1. Figure 1: TeamUp landing page showcasing three core fea￾tures: AI-powered skill analysis using NLP for resume pars￾ing and skill extraction, team complementarity modeling for balanced team formation, and semantic matching for per￾sonalized project recommendations. experiment on dummy users demonstrating that the approach pro￾duces substantially better allocations than random assignment; and (3) design patterns for r… view at source ↗
Figure 2
Figure 2. Figure 2: Administrative project management interface show view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of TeamUp versus random allocation across key metrics on dummy user data 5.3 What This Tells Us These results confirm the algorithm does what it is designed to do: it produces better matches and more balanced teams than ran￾dom allocation, and it runs fast enough for real-time use. However, dummy data cannot tell us how real students would experience the system, whether they would find the recom… view at source ↗
read the original abstract

Project-based learning improves student engagement and learning outcomes, yet allocating students to appropriately challenging projects while forming cognitively diverse teams remains difficult at scale. Traditional allocation methods (manual spreadsheets, preference surveys) can't construct the cognitively diverse teams that that collaborate cognitively. This mismatch perpetuates equity issues: high-performing students self-select visible projects while under-represented students face reduced access to opportunity. We propose TeamUp, a lightweight, embedding-based team-forming system designed to improve learning outcomes and equity in large-scale project-based courses. TeamUp uses semantic embeddings from pretrained language models to match students to projects aligned with their skill level. The system employs a hybrid ranking algorithm combining cosine similarity with pedagogical constraints (difficulty alignment, domain preferences, and demand balancing) to generate personalised and transparent recommendations. Beyond individual matching, TeamUp constructs cognitively diverse teams by modelling skill complementarity through embedding variance, ensuring teams possess well-distributed capabilities rather than homogeneous strengths. We evaluated TeamUp through a virtual experiment using 250 student profiles and 60 project descriptions. Results show: (1) substantially higher match quality (mean cosine similarity of 0.74 vs. 0.43); (2) better difficulty alignment (83% placed within one level vs. 34%); (3) more diverse teams (82% covering three or more technical areas vs. 41%); and (4) sub-second recommendation latency at operational costs under $0.10 per student.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes TeamUp, an embedding-based system for matching students to projects and forming cognitively diverse teams in large-scale project-based courses. It uses pretrained language model embeddings to compute cosine similarity for skill alignment, combined with a hybrid ranking algorithm that incorporates pedagogical constraints such as difficulty alignment, domain preferences, and demand balancing. Teams are formed by maximizing embedding variance to promote skill complementarity. Evaluation is performed exclusively via a virtual experiment on 250 synthetically generated student profiles and 60 project descriptions, claiming improvements over an implied baseline in match quality (mean cosine similarity 0.74 vs. 0.43), difficulty alignment (83% vs. 34% within one level), team diversity (82% vs. 41% covering three or more areas), and operational efficiency (sub-second latency, <$0.10 per student).

Significance. If the core assumptions hold, TeamUp offers a lightweight, scalable approach to a persistent challenge in educational technology: equitable and cognitively effective team allocation at scale. The hybrid algorithm and use of embedding variance for diversity are practical strengths that could integrate into existing learning management systems with low cost. The work highlights equity issues in self-selection but its impact is constrained by the synthetic nature of the evaluation.

major comments (3)
  1. Evaluation section (virtual experiment): All headline results (cosine similarity 0.74 vs. 0.43, 83% vs. 34% difficulty alignment, 82% vs. 41% diversity) are obtained solely from 250 generated student profiles and 60 generated project descriptions. No real student data, self-reported skills, human ratings of cognitive fit, or measured learning outcomes are provided, leaving the claims that TeamUp improves learning outcomes and equity dependent on unverified assumptions about data realism and embedding validity as proxies.
  2. Results presentation: The reported metrics are given as point estimates without statistical tests, standard deviations, confidence intervals, or error bars. This makes it impossible to assess whether the observed differences (e.g., 0.74 vs. 0.43) are robust or could arise from the synthetic generation process itself.
  3. Method and assumptions: The central modeling choice—that pretrained LM embeddings accurately represent student skill levels, project requirements, and cognitive complementarity—is load-bearing for all quantitative claims but receives no validation (e.g., no correlation with human judgments, no ablation across embedding models, no sensitivity analysis on synthetic profile generation parameters).
minor comments (3)
  1. Abstract: Duplicate wording 'that that collaborate cognitively' should be corrected.
  2. Abstract and evaluation: The baseline method yielding 0.43/34%/41% is not described, preventing readers from understanding what the improvements are measured against.
  3. Method: Specify the exact pretrained language model(s) used, any preprocessing of profiles/descriptions, and the precise formulation of the hybrid ranking algorithm (including how constraints are weighted).

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments on our manuscript. We agree that the evaluation has limitations due to its synthetic nature and will revise the paper to better address these concerns by adding statistical analyses, sensitivity checks, and expanded discussion of assumptions and limitations. We respond to each major comment below.

read point-by-point responses
  1. Referee: Evaluation section (virtual experiment): All headline results (cosine similarity 0.74 vs. 0.43, 83% vs. 34% difficulty alignment, 82% vs. 41% diversity) are obtained solely from 250 generated student profiles and 60 generated project descriptions. No real student data, self-reported skills, human ratings of cognitive fit, or measured learning outcomes are provided, leaving the claims that TeamUp improves learning outcomes and equity dependent on unverified assumptions about data realism and embedding validity as proxies.

    Authors: We acknowledge the validity of this concern. The evaluation is indeed based solely on synthetic data, which was generated to simulate realistic student profiles and project descriptions based on typical computer science course structures. While this approach allows for controlled testing of the algorithm, it does limit the strength of claims about real-world learning outcomes and equity improvements. In the revised manuscript, we will add a dedicated subsection in the Evaluation and Limitations sections to discuss the assumptions underlying the synthetic data generation, the potential gaps in using embeddings as proxies for skills, and the need for future real-world studies with actual student data and human evaluations. We will also temper the claims in the abstract and conclusion to reflect this. revision: partial

  2. Referee: Results presentation: The reported metrics are given as point estimates without statistical tests, standard deviations, confidence intervals, or error bars. This makes it impossible to assess whether the observed differences (e.g., 0.74 vs. 0.43) are robust or could arise from the synthetic generation process itself.

    Authors: We agree that including measures of statistical significance and variability would strengthen the results. Since the data is synthetic, we can re-run the experiments with multiple generations or report variability. In the revision, we will include standard deviations for the metrics, compute confidence intervals, and perform appropriate statistical tests (such as paired t-tests) to show that the improvements are significant. We will also add error bars to relevant figures and discuss the robustness with respect to the synthetic generation process. revision: yes

  3. Referee: Method and assumptions: The central modeling choice—that pretrained LM embeddings accurately represent student skill levels, project requirements, and cognitive complementarity—is load-bearing for all quantitative claims but receives no validation (e.g., no correlation with human judgments, no ablation across embedding models, no sensitivity analysis on synthetic profile generation parameters).

    Authors: This is a fair point; the validity of the embeddings is a key assumption. We will revise the Method section to include an ablation study comparing at least two different embedding models (e.g., all-MiniLM-L6-v2 and a larger model like MPNet) to show consistency. We will also add a sensitivity analysis varying the parameters of the synthetic profile generator (e.g., skill distribution variance) and report how results change. Regarding human judgments, we note that correlating embeddings with human ratings would require additional data collection not available in this study, but we will cite relevant literature on the use of embeddings in educational matching and discuss this as a limitation. revision: partial

standing simulated objections not resolved
  • Providing real student data, self-reported skills, human ratings of cognitive fit, or measured learning outcomes, as these would necessitate a separate IRB-approved study with actual participants, which is outside the scope of the current virtual experiment.

Circularity Check

0 steps flagged

No significant circularity; evaluation metrics independent of system internals

full rationale

The paper describes an engineering system (embedding-based matching plus constraints, variance-based diversity) and evaluates it empirically on separately generated synthetic profiles. Reported metrics (cosine similarity, difficulty-level percentages, technical-area coverage counts) are computed after the fact using definitions that do not appear in the system's ranking equations or data-generation process. No derivation, uniqueness theorem, or fitted parameter is invoked; the comparison to baseline simply shows that the chosen algorithm scores higher on the chosen external metrics. The evaluation therefore remains self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that pretrained language model embeddings serve as faithful proxies for skill levels and complementarity; no free parameters or new entities are explicitly introduced or fitted in the abstract.

axioms (1)
  • domain assumption Pretrained language model embeddings capture semantic similarity between student skills and project descriptions sufficiently for matching and diversity measurement.
    Invoked as the foundation for both individual matching and team construction steps.

pith-pipeline@v0.9.0 · 5562 in / 1305 out tokens · 66713 ms · 2026-05-07T16:15:39.282981+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    Abraham, Robert W

    David J. Abraham, Robert W. Irving, and David F. Manlove. 2007. Two Algorithms for the Student-Project Allocation Problem.Journal of Discrete Algorithms5, 1 (2007), 73–90

  2. [2]

    Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz

    Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human- AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 1–13

  3. [3]

    Baker and Aaron Hawn

    Ryan S. Baker and Aaron Hawn. 2022. Algorithmic Bias in Education.Interna- tional Journal of Artificial Intelligence in Education32 (2022), 1052–1092

  4. [4]

    Albert Bandura. 1994. Self-Efficacy. InEncyclopedia of Human Behavior, V. S. Ramachaudran (Ed.). Vol. 4. Academic Press, New York, 71–81

  5. [5]

    Blumenfeld, Elliot Soloway, Ronald W

    Phyllis C. Blumenfeld, Elliot Soloway, Ronald W. Marx, Joseph S. Krajcik, Mark Guzdial, and Annemarie Palincsar. 1991. Motivating Project-Based Learning: Sustaining the Doing, Supporting the Learning.Educational Psychologist26, 3–4 (1991), 369–398

  6. [6]

    Zou, Venkatesh Saligrama, and Adam T

    Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. 2016. Man is to Computer Programmer as Woman is to Homemaker? De- biasing Word Embeddings. InAdvances in Neural Information Processing Systems, Vol. 29

  7. [7]

    What” and “Why

    Edward L. Deci and Richard M. Ryan. 2000. The “What” and “Why” of Goal Pursuits: Human Needs and the Self-Determination of Behavior.Psychological Inquiry11, 4 (2000), 227–268

  8. [8]

    Santos, and Nikos Manouselis

    Hendrik Drachsler, Katrien Verbert, Olga C. Santos, and Nikos Manouselis. 2015. Panorama of Recommender Systems to Support Learning. InRecommender Systems Handbook. Springer, Boston, MA, 421–451

  9. [9]

    Harper, Vinícius de Senna, Igor T

    Paul R. Harper, Vinícius de Senna, Igor T. Vieira, and Arjan K. Shahani. 2005. A Genetic Algorithm for the Project Assignment Problem.Computers & Operations Research32, 5 (2005), 1255–1265

  10. [10]

    Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé III, Miro Dudik, and Hanna Wallach. 2019. Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need?. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 1–16

  11. [11]

    Zhuoren Jiang, Yao Zhang, and Xing Li. 2019. Course Recommendation with Learner and Course Embeddings. InProceedings of the 9th International Conference on Learning Analytics & Knowledge. ACM, New York, NY, USA, 46–55

  12. [12]

    Johnson and Roger T

    David W. Johnson and Roger T. Johnson. 1991.Cooperative Learning: Increasing College Faculty Instructional Productivity. ASHE-ERIC Higher Education Report, Vol. 4. School of Education and Human Development, George Washington University

  13. [13]

    Aleksandra Klašnja-Milićević, Boban Vesin, Mirjana Ivanović, and Zoran Budimac

  14. [14]

    E-Learning Personalization Based on Hybrid Recommendation Strategy and Learning Style Identification.Computers & Education56, 3 (2011), 885–899

  15. [15]

    Theodoros Lappas, Kun Liu, and Evimaria Terzi. 2009. Finding a Team of Ex- perts in Social Networks. InProceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, 467–476

  16. [16]

    Corrado, and Jeff Dean

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. InAdvances in Neural Information Processing Systems, Vol. 26

  17. [17]

    Mills and David F

    Julie E. Mills and David F. Treagust. 2003. Engineering Education—Is Problem- Based or Project-Based Learning the Answer?Australasian Journal of Engineering Education3, 2 (2003), 2–16

  18. [18]

    pgvector contributors. 2023. pgvector: Open-Source Vector Similarity Search for Postgres. https://github.com/pgvector/pgvector. Accessed: 2025-12-15

  19. [19]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 3982–3992

  20. [20]

    Vygotsky

    Lev S. Vygotsky. 1978.Mind in Society: The Development of Higher Psychological Processes. Harvard University Press, Cambridge, MA

  21. [21]

    Williams and Charles A

    Katherine Y. Williams and Charles A. O’Reilly III. 1998. Demography and Diver- sity in Organizations: A Review of 40 Years of Research.Research in Organiza- tional Behavior20 (1998), 77–140