Cross-Course Generalizability of SRL-Aligned Predictive Models Using Digital Learning Traces
Pith reviewed 2026-05-10 14:02 UTC · model grok-4.3
The pith
Predictive models using self-regulated learning digital traces identify at-risk students early within courses but lose accuracy and calibration across institutions with different base rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using multimodal digital-trace data from three undergraduate theoretical computer science courses at two universities, the study builds weekly SRL-aligned indicators and fits Elastic Net, Random Forest, and XGBoost models to predict at-risk status. Early prediction within courses proves feasible, with SRL behaviors such as time management and effort regulation emerging as key predictors. Random Forest reaches highest in-sample accuracy, yet Elastic Net generalizes more robustly across courses and institutions. Out-of-sample accuracy and calibration fall when models move between institutions that have different base rates of at-risk students.
What carries the argument
Weekly self-regulated learning aligned digital-trace indicators extracted from learning management systems and used as features in Elastic Net, Random Forest, and XGBoost classifiers for at-risk prediction.
If this is right
- Early identification of at-risk students is feasible inside individual courses using SRL-aligned traces.
- Behaviors tied to time management, effort regulation, and engagement function as strong predictors across the tested models.
- Elastic Net models maintain better performance when moved to new courses or institutions than Random Forest models.
- Model accuracy and calibration decrease when the proportion of at-risk students differs between the training context and the new setting.
- Predictive analytics in higher education should account for institutional context rather than assume direct transfer of models.
Where Pith is reading between the lines
- Institutions may achieve better results by building or retraining models on their own local data instead of importing models from elsewhere.
- Correcting for differences in base rates before applying a model could reduce the observed drop in performance.
- The same trace-based approach might be tried in non-theory courses to test whether the generalizability limits remain similar.
- Collecting richer or more frequent trace data could help stabilize predictions when moving models between settings.
Load-bearing premise
The weekly digital-trace indicators capture self-regulated learning constructs in a consistent and comparable manner across courses and institutions despite differing base rates.
What would settle it
Testing the same models on data from additional institutions that share the same at-risk base rate as the training data and measuring whether accuracy and calibration stay stable.
read the original abstract
STEM dropout rates remain high at universities, particularly in computer science programs with theory-intensive courses. Digital learning environments now capture rich behavioral data that could help identify struggling students early, yet the generalizability of data-driven prediction models across courses and institutions remains uncertain. Guided by self-regulated learning (SRL) theory, this study analyzed multimodal digital-trace data from three undergraduate theoretical computer science courses (N1 = 137, N2 = 104, N3 = 148) at two universities. Weekly SRL-aligned digital-trace indicators were modeled using Elastic Net, Random Forest, and XGBoost to evaluate predictive performance over time and across settings, and model calibration both within and across courses. Early prediction of at-risk students was feasible, with SRL-related behaviors such as time management, effort regulation, and sustained engagement emerging as key predictors. While Random Forest achieved the highest in-sample accuracy, Elastic Net generalized more robustly across contexts. Out-of-sample accuracy and calibration declined between institutions with different base rates, underscoring the contextual nature of predictive analytics in higher education. These findings suggest that digital learning traces enable early identification of at-risk students within courses, but generalizing predictive models beyond their original context requires caution, particularly if the at-risk rates differ between contexts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes multimodal digital-trace data from three undergraduate theoretical computer science courses (N=137, 104, 148) at two universities to build and evaluate Elastic Net, Random Forest, and XGBoost models for early prediction of at-risk students. Guided by self-regulated learning (SRL) theory, it constructs weekly SRL-aligned indicators, reports that behaviors such as time management, effort regulation, and sustained engagement are key predictors, finds Random Forest highest in-sample accuracy while Elastic Net generalizes more robustly across contexts, and notes declines in out-of-sample accuracy and calibration when base rates differ between institutions.
Significance. If the central empirical patterns hold after addressing feature validity and methodological transparency, the work provides concrete evidence on the feasibility and limits of cross-course/institution transfer in learning analytics. It strengthens the case for SRL-informed feature engineering while documenting the practical impact of differing prevalence rates, which is directly relevant to deployment decisions in higher-education predictive systems.
major comments (3)
- [Methods] Methods (feature construction): The claim that weekly digital-trace indicators reliably capture SRL constructs such as time management and effort regulation across courses with different designs and LMS implementations is load-bearing for the interpretation of feature importance and cross-context generalization, yet the manuscript provides no validation (e.g., correlation with established SRL scales or course-specific mapping) that the engineered features are equivalent proxies.
- [Results] Results (cross-institution evaluation): The reported decline in out-of-sample accuracy and calibration is attributed to differing base rates, but without explicit reporting of the exact train/test splits, hyperparameter search procedure, and calibration metrics (e.g., Brier score or reliability diagrams) per course pair, it is impossible to distinguish base-rate effects from feature misalignment or overfitting.
- [Results] Table X (model performance): The statement that Elastic Net generalizes more robustly than Random Forest requires the full set of per-course AUC, F1, and calibration values plus statistical tests for the difference; the abstract summary alone does not establish that the generalization advantage is robust rather than an artifact of the particular held-out institution.
minor comments (2)
- [Abstract] The abstract and results sections should explicitly state the exact definition of 'at-risk' (e.g., final grade threshold) and the temporal window used for early prediction to allow replication.
- [Introduction] Missing references to prior SRL digital-trace studies (e.g., work on LMS log-based SRL measurement) would help situate the feature-engineering choices.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and indicate where revisions will be made to improve clarity, transparency, and interpretability.
read point-by-point responses
-
Referee: [Methods] Methods (feature construction): The claim that weekly digital-trace indicators reliably capture SRL constructs such as time management and effort regulation across courses with different designs and LMS implementations is load-bearing for the interpretation of feature importance and cross-context generalization, yet the manuscript provides no validation (e.g., correlation with established SRL scales or course-specific mapping) that the engineered features are equivalent proxies.
Authors: We acknowledge that the manuscript relies on theoretical mapping rather than direct empirical validation of the digital-trace proxies against SRL scales. Feature definitions were derived from established SRL literature linking trace data (e.g., assignment submission timing for time management, login regularity for effort regulation) to constructs, with citations to prior learning-analytics work. In the revision we will add an explicit mapping table in the Methods section and expand the Limitations discussion to note the proxy nature of the indicators and the absence of questionnaire-based validation. We cannot retroactively collect SRL scale data, but the added mapping and caveats will strengthen the interpretive foundation. revision: partial
-
Referee: [Results] Results (cross-institution evaluation): The reported decline in out-of-sample accuracy and calibration is attributed to differing base rates, but without explicit reporting of the exact train/test splits, hyperparameter search procedure, and calibration metrics (e.g., Brier score or reliability diagrams) per course pair, it is impossible to distinguish base-rate effects from feature misalignment or overfitting.
Authors: We agree that additional methodological detail is required. The revised manuscript will include a new subsection specifying: (1) the precise train/test partitions for every cross-institution and cross-course pair, (2) the full hyperparameter search procedure (grid search ranges, cross-validation folds, and selection criterion), and (3) per-pair calibration results with Brier scores plus reliability diagrams. These additions will enable readers to isolate base-rate effects from other sources of performance change. revision: yes
-
Referee: [Results] Table X (model performance): The statement that Elastic Net generalizes more robustly than Random Forest requires the full set of per-course AUC, F1, and calibration values plus statistical tests for the difference; the abstract summary alone does not establish that the generalization advantage is robust rather than an artifact of the particular held-out institution.
Authors: We accept that the current presentation is insufficient. The revision will expand the results to report the complete matrix of per-course and per-pair metrics (AUC, F1, precision, recall, Brier score) for all three models. We will also add statistical comparisons (DeLong tests for AUC differences and appropriate paired tests for other metrics) between Elastic Net and Random Forest across held-out contexts to support the generalization claim with quantitative evidence. revision: yes
Circularity Check
No significant circularity; standard empirical ML study with held-out evaluation
full rationale
The paper collects multimodal digital-trace data from three courses, engineers weekly SRL-aligned indicators based on established theory, trains Elastic Net/RF/XGBoost models, and evaluates in-sample, out-of-sample, and cross-course performance plus calibration. No equations, derivations, or self-referential definitions appear; predictions are generated from fitted models on independent splits rather than being forced by construction. Feature validity is an empirical assumption (not a definitional loop), and cross-context drops are reported as observed outcomes rather than tautological results. This matches the default non-circular case for applied predictive modeling.
Axiom & Free-Parameter Ledger
free parameters (2)
- Elastic Net regularization strength and mixing parameter
- Random Forest and XGBoost hyperparameters (trees, depth, learning rate)
axioms (1)
- domain assumption Digital learning traces accurately reflect self-regulated learning behaviors such as time management and effort regulation
Reference graph
Works this paper leans on
-
[1]
https://doi.org/10.1111/bjet.13015 De Cock, B., Nieboer, D., Van Calster, B., Steyerberg, E. W., & Vergouwe, Y. (2023).The calibra- tioncurves package: Assessing the agreement between observed outcomes and predictions.[R package version 2.0.3]. https://doi.org/10.32614/CRAN.package.CalibrationCurves De Cock Campo, B. (2025). Introduction to the Calibratio...
-
[2]
The number of trees was set to one of the following values: 500, 1000, or 2000. For XGBoost, the maximum tree depth was varied on3,4,...,10and the minimum child weight was between 1 and 10. The subsample ratio was between 0.5 and 1 and the column sampling ratio was between 0.5 and 1. Each range was discretized into eight candidate values. Classification T...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.