Cross-Course Generalizability of SRL-Aligned Predictive Models Using Digital Learning Traces

Di Xu; Jakob Schwerter; Judith Bose; Loreen Sabel; Marko Schmellenkamp; Matthew L. Bernacki; Philipp Doebler; Thomas Zeume

arxiv: 2604.22812 · v1 · submitted 2026-04-14 · 💻 cs.CY · cs.LG· stat.AP

Cross-Course Generalizability of SRL-Aligned Predictive Models Using Digital Learning Traces

Jakob Schwerter , Loreen Sabel , Judith Bose , Matthew L. Bernacki , Di Xu , Marko Schmellenkamp , Thomas Zeume , Philipp Doebler This is my paper

Pith reviewed 2026-05-10 14:02 UTC · model grok-4.3

classification 💻 cs.CY cs.LGstat.AP

keywords self-regulated learningdigital learning tracesat-risk studentspredictive modelsgeneralizabilityhigher educationmachine learningSTEM education

0 comments

The pith

Predictive models using self-regulated learning digital traces identify at-risk students early within courses but lose accuracy and calibration across institutions with different base rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether machine learning models trained on weekly digital learning traces aligned with self-regulated learning can forecast which students will struggle in theoretical computer science courses. The authors use data from three courses across two universities to check how well predictions hold over time within a course and when moved to new settings. They show that indicators of time management, effort regulation, and sustained engagement serve as reliable predictors for early detection inside the original course. The work also demonstrates that while one model type performs best on familiar data, another transfers better, yet overall results decline sharply when the share of at-risk students differs between training and new contexts.

Core claim

Using multimodal digital-trace data from three undergraduate theoretical computer science courses at two universities, the study builds weekly SRL-aligned indicators and fits Elastic Net, Random Forest, and XGBoost models to predict at-risk status. Early prediction within courses proves feasible, with SRL behaviors such as time management and effort regulation emerging as key predictors. Random Forest reaches highest in-sample accuracy, yet Elastic Net generalizes more robustly across courses and institutions. Out-of-sample accuracy and calibration fall when models move between institutions that have different base rates of at-risk students.

What carries the argument

Weekly self-regulated learning aligned digital-trace indicators extracted from learning management systems and used as features in Elastic Net, Random Forest, and XGBoost classifiers for at-risk prediction.

If this is right

Early identification of at-risk students is feasible inside individual courses using SRL-aligned traces.
Behaviors tied to time management, effort regulation, and engagement function as strong predictors across the tested models.
Elastic Net models maintain better performance when moved to new courses or institutions than Random Forest models.
Model accuracy and calibration decrease when the proportion of at-risk students differs between the training context and the new setting.
Predictive analytics in higher education should account for institutional context rather than assume direct transfer of models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Institutions may achieve better results by building or retraining models on their own local data instead of importing models from elsewhere.
Correcting for differences in base rates before applying a model could reduce the observed drop in performance.
The same trace-based approach might be tried in non-theory courses to test whether the generalizability limits remain similar.
Collecting richer or more frequent trace data could help stabilize predictions when moving models between settings.

Load-bearing premise

The weekly digital-trace indicators capture self-regulated learning constructs in a consistent and comparable manner across courses and institutions despite differing base rates.

What would settle it

Testing the same models on data from additional institutions that share the same at-risk base rate as the training data and measuring whether accuracy and calibration stay stable.

read the original abstract

STEM dropout rates remain high at universities, particularly in computer science programs with theory-intensive courses. Digital learning environments now capture rich behavioral data that could help identify struggling students early, yet the generalizability of data-driven prediction models across courses and institutions remains uncertain. Guided by self-regulated learning (SRL) theory, this study analyzed multimodal digital-trace data from three undergraduate theoretical computer science courses (N1 = 137, N2 = 104, N3 = 148) at two universities. Weekly SRL-aligned digital-trace indicators were modeled using Elastic Net, Random Forest, and XGBoost to evaluate predictive performance over time and across settings, and model calibration both within and across courses. Early prediction of at-risk students was feasible, with SRL-related behaviors such as time management, effort regulation, and sustained engagement emerging as key predictors. While Random Forest achieved the highest in-sample accuracy, Elastic Net generalized more robustly across contexts. Out-of-sample accuracy and calibration declined between institutions with different base rates, underscoring the contextual nature of predictive analytics in higher education. These findings suggest that digital learning traces enable early identification of at-risk students within courses, but generalizing predictive models beyond their original context requires caution, particularly if the at-risk rates differ between contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows SRL-framed digital traces predict at-risk students inside courses but lose accuracy and calibration across institutions, with Elastic Net holding up better than Random Forest when base rates differ.

read the letter

The core finding is that early-warning models built on weekly LMS traces tied to time management, effort regulation, and engagement can work within a theoretical CS course, yet they degrade when applied to another institution's course, mainly because the share of at-risk students is not the same. Random Forest fits the original data best, but Elastic Net transfers more reliably, and both show poorer calibration out of sample when base rates shift. That pattern is the useful part of the work. It gives concrete numbers on how much performance drops and ties the drop to a measurable difference in prevalence rather than leaving it as a vague warning about context. The three courses (N=137, 104, 148) at two universities supply a real multi-context test, which is more than most single-course studies deliver. The week-by-week tracking and explicit calibration checks are also practical for anyone building retention tools. The main limitation is that the features are labeled SRL-aligned without much visible evidence that the same trace captures the same construct once course structure, LMS implementation, or student population changes. If the indicators are not equivalent, then attributing the performance drop solely to base rates becomes harder to defend, and the claim that specific SRL behaviors are the key predictors rests on thinner ground. Sample sizes are modest for cross-validation and hyperparameter work, so the generalization results could move with different splits or tuning choices. This is the kind of paper that belongs in a reading group for people who actually deploy early-warning systems in STEM departments. It is honest about the limits it finds and does not overclaim transportability. A serious editor should send it to referees rather than desk-reject it; the empirical question is live and the data are from real courses.

Referee Report

3 major / 2 minor

Summary. The manuscript analyzes multimodal digital-trace data from three undergraduate theoretical computer science courses (N=137, 104, 148) at two universities to build and evaluate Elastic Net, Random Forest, and XGBoost models for early prediction of at-risk students. Guided by self-regulated learning (SRL) theory, it constructs weekly SRL-aligned indicators, reports that behaviors such as time management, effort regulation, and sustained engagement are key predictors, finds Random Forest highest in-sample accuracy while Elastic Net generalizes more robustly across contexts, and notes declines in out-of-sample accuracy and calibration when base rates differ between institutions.

Significance. If the central empirical patterns hold after addressing feature validity and methodological transparency, the work provides concrete evidence on the feasibility and limits of cross-course/institution transfer in learning analytics. It strengthens the case for SRL-informed feature engineering while documenting the practical impact of differing prevalence rates, which is directly relevant to deployment decisions in higher-education predictive systems.

major comments (3)

[Methods] Methods (feature construction): The claim that weekly digital-trace indicators reliably capture SRL constructs such as time management and effort regulation across courses with different designs and LMS implementations is load-bearing for the interpretation of feature importance and cross-context generalization, yet the manuscript provides no validation (e.g., correlation with established SRL scales or course-specific mapping) that the engineered features are equivalent proxies.
[Results] Results (cross-institution evaluation): The reported decline in out-of-sample accuracy and calibration is attributed to differing base rates, but without explicit reporting of the exact train/test splits, hyperparameter search procedure, and calibration metrics (e.g., Brier score or reliability diagrams) per course pair, it is impossible to distinguish base-rate effects from feature misalignment or overfitting.
[Results] Table X (model performance): The statement that Elastic Net generalizes more robustly than Random Forest requires the full set of per-course AUC, F1, and calibration values plus statistical tests for the difference; the abstract summary alone does not establish that the generalization advantage is robust rather than an artifact of the particular held-out institution.

minor comments (2)

[Abstract] The abstract and results sections should explicitly state the exact definition of 'at-risk' (e.g., final grade threshold) and the temporal window used for early prediction to allow replication.
[Introduction] Missing references to prior SRL digital-trace studies (e.g., work on LMS log-based SRL measurement) would help situate the feature-engineering choices.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and indicate where revisions will be made to improve clarity, transparency, and interpretability.

read point-by-point responses

Referee: [Methods] Methods (feature construction): The claim that weekly digital-trace indicators reliably capture SRL constructs such as time management and effort regulation across courses with different designs and LMS implementations is load-bearing for the interpretation of feature importance and cross-context generalization, yet the manuscript provides no validation (e.g., correlation with established SRL scales or course-specific mapping) that the engineered features are equivalent proxies.

Authors: We acknowledge that the manuscript relies on theoretical mapping rather than direct empirical validation of the digital-trace proxies against SRL scales. Feature definitions were derived from established SRL literature linking trace data (e.g., assignment submission timing for time management, login regularity for effort regulation) to constructs, with citations to prior learning-analytics work. In the revision we will add an explicit mapping table in the Methods section and expand the Limitations discussion to note the proxy nature of the indicators and the absence of questionnaire-based validation. We cannot retroactively collect SRL scale data, but the added mapping and caveats will strengthen the interpretive foundation. revision: partial
Referee: [Results] Results (cross-institution evaluation): The reported decline in out-of-sample accuracy and calibration is attributed to differing base rates, but without explicit reporting of the exact train/test splits, hyperparameter search procedure, and calibration metrics (e.g., Brier score or reliability diagrams) per course pair, it is impossible to distinguish base-rate effects from feature misalignment or overfitting.

Authors: We agree that additional methodological detail is required. The revised manuscript will include a new subsection specifying: (1) the precise train/test partitions for every cross-institution and cross-course pair, (2) the full hyperparameter search procedure (grid search ranges, cross-validation folds, and selection criterion), and (3) per-pair calibration results with Brier scores plus reliability diagrams. These additions will enable readers to isolate base-rate effects from other sources of performance change. revision: yes
Referee: [Results] Table X (model performance): The statement that Elastic Net generalizes more robustly than Random Forest requires the full set of per-course AUC, F1, and calibration values plus statistical tests for the difference; the abstract summary alone does not establish that the generalization advantage is robust rather than an artifact of the particular held-out institution.

Authors: We accept that the current presentation is insufficient. The revision will expand the results to report the complete matrix of per-course and per-pair metrics (AUC, F1, precision, recall, Brier score) for all three models. We will also add statistical comparisons (DeLong tests for AUC differences and appropriate paired tests for other metrics) between Elastic Net and Random Forest across held-out contexts to support the generalization claim with quantitative evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard empirical ML study with held-out evaluation

full rationale

The paper collects multimodal digital-trace data from three courses, engineers weekly SRL-aligned indicators based on established theory, trains Elastic Net/RF/XGBoost models, and evaluates in-sample, out-of-sample, and cross-course performance plus calibration. No equations, derivations, or self-referential definitions appear; predictions are generated from fitted models on independent splits rather than being forced by construction. Feature validity is an empirical assumption (not a definitional loop), and cross-context drops are reported as observed outcomes rather than tautological results. This matches the default non-circular case for applied predictive modeling.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that logged digital interactions can be mapped to SRL constructs and that the three sampled courses provide a sufficient test of cross-context generalization.

free parameters (2)

Elastic Net regularization strength and mixing parameter
Standard hyperparameters tuned during model fitting to balance sparsity and performance.
Random Forest and XGBoost hyperparameters (trees, depth, learning rate)
Chosen or optimized to maximize in-sample accuracy on the training data.

axioms (1)

domain assumption Digital learning traces accurately reflect self-regulated learning behaviors such as time management and effort regulation
Invoked when constructing weekly SRL-aligned indicators from raw logs.

pith-pipeline@v0.9.0 · 5551 in / 1578 out tokens · 43333 ms · 2026-05-10T14:02:18.428257+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

W., & Vergouwe, Y

https://doi.org/10.1111/bjet.13015 De Cock, B., Nieboer, D., Van Calster, B., Steyerberg, E. W., & Vergouwe, Y. (2023).The calibra- tioncurves package: Assessing the agreement between observed outcomes and predictions.[R package version 2.0.3]. https://doi.org/10.32614/CRAN.package.CalibrationCurves De Cock Campo, B. (2025). Introduction to the Calibratio...

work page doi:10.1111/bjet.13015 2023
[2]

For XGBoost, the maximum tree depth was varied on3,4,...,10and the minimum child weight was between 1 and 10

The number of trees was set to one of the following values: 500, 1000, or 2000. For XGBoost, the maximum tree depth was varied on3,4,...,10and the minimum child weight was between 1 and 10. The subsample ratio was between 0.5 and 1 and the column sampling ratio was between 0.5 and 1. Each range was discretized into eight candidate values. Classification T...

work page 2000

[1] [1]

W., & Vergouwe, Y

https://doi.org/10.1111/bjet.13015 De Cock, B., Nieboer, D., Van Calster, B., Steyerberg, E. W., & Vergouwe, Y. (2023).The calibra- tioncurves package: Assessing the agreement between observed outcomes and predictions.[R package version 2.0.3]. https://doi.org/10.32614/CRAN.package.CalibrationCurves De Cock Campo, B. (2025). Introduction to the Calibratio...

work page doi:10.1111/bjet.13015 2023

[2] [2]

For XGBoost, the maximum tree depth was varied on3,4,...,10and the minimum child weight was between 1 and 10

The number of trees was set to one of the following values: 500, 1000, or 2000. For XGBoost, the maximum tree depth was varied on3,4,...,10and the minimum child weight was between 1 and 10. The subsample ratio was between 0.5 and 1 and the column sampling ratio was between 0.5 and 1. Each range was discretized into eight candidate values. Classification T...

work page 2000