When Can We Trust Early Warnings? Leakage-Excluded Early Outcome Prediction from LMS Interaction Logs

Bertrand Laforge; Marie-H\'el\`ene Abel; Ngoc Luyen Le

arxiv: 2605.25794 · v1 · pith:L7FQOTOKnew · submitted 2026-05-25 · 💻 cs.AI

When Can We Trust Early Warnings? Leakage-Excluded Early Outcome Prediction from LMS Interaction Logs

Ngoc Luyen Le , Marie-H\'el\`ene Abel , Bertrand Laforge This is my paper

Pith reviewed 2026-06-29 21:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords early outcome predictiontemporal leakagelearning management systemsOULAD datasetmachine learning evaluationearly warning systems

0 comments

The pith

Enforcing cutoff-first truncation before any joins or aggregations removes temporal leakage from early LMS outcome predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Early-warning models that predict course outcomes from LMS logs often report strong early performance that actually draws on data arriving after the chosen prediction time. The paper formalizes this temporal availability constraint and presents LEAP, a protocol that truncates every log record to the cutoff date first, before any joins, aggregations, or feature creation, then audits each feature to confirm its provenance stays within the cutoff. When LEAP is applied to the OULAD dataset across successive weekly cutoffs, performance still rises with more observation time and shows a noticeable lift near week three, yet the absolute numbers are lower once assessment-related leakage is blocked. Standard classifiers behave differently by cutoff: Random Forest leads at the earliest points while Gradient Boosting overtakes later. The central result is that trustworthy early predictions require this strict ordering of operations rather than post-hoc filtering.

Core claim

Cutoff-based early outcome prediction must respect a temporal availability constraint; LEAP enforces it by truncating interaction logs to the cutoff before joins or aggregation and by auditing feature provenance, which prevents post-cutoff evidence from entering the evaluation and shows that leakage, especially from assessments, inflates apparent early performance on OULAD.

What carries the argument

LEAP (Leakage-Excluded Early-Availability Protocol), which performs cutoff-first truncation of logs prior to any joins and aggregation and audits feature provenance to keep all evidence within the chosen time window.

If this is right

Prediction quality improves steadily as the observation window lengthens, with a distinct gain near week three.
Random Forest yields the strongest results at the earliest cutoffs; Gradient Boosting becomes superior once more weeks are available.
Ablating assessment-related features that cross the cutoff lowers the reported early performance, confirming leakage as the source of inflation.
Multi-metric evaluation with ROC-AUC, PR-AUC, Brier score, and F1@0.5 gives a more stable picture than any single score alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cutoff-first discipline could be applied to any timestamped log dataset used for early prediction, not only LMS data.
If a dataset lacks precise timestamps, LEAP-style evaluation becomes impossible and reported early results should carry an explicit uncertainty label.
Future model architectures might embed the cutoff constraint directly into the learning objective instead of relying on post-processing audits.

Load-bearing premise

The timestamps recorded in the OULAD interaction logs are accurate and fine-grained enough that cutoff-based truncation does not discard essential patterns or create hidden temporal dependencies.

What would settle it

Apply the same classifiers to OULAD once with standard processing and once with LEAP truncation plus provenance audit; if the early-week ROC-AUC, PR-AUC, and F1 scores do not drop when leakage is blocked, the claim that temporal violations were inflating results would be falsified.

Figures

Figures reproduced from arXiv: 2605.25794 by Bertrand Laforge, Marie-H\'el\`ene Abel, Ngoc Luyen Le.

**Figure 1.** Figure 1: Example of an observation window ending at Day 14: only records with τ ≤ 14 are observed; later records (Days 15–56) are excluded to prevent temporal leakage. In [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: LEAP pipeline at cutoff t: time truncation precedes feature construction, leakage checks enforce temporal validity, and models are trained and evaluated per cutoff. checks to ensure that no retained record occurs after t. The filtered record set is subsequently transformed into an early representation x (t) i = ϕ(R (≤t) i ), which is paired with its end-of-course label to form the cutoff-specific dataset … view at source ↗

**Figure 3.** Figure 3: Earliness–performance curves under strict LEAP (mean±std over 5 seeds). 5.1 Results RQ1 - Earliness–Performance Trends [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Leakage ablation: strict LEAP vs. intentionally leaky variants. RQ4 - Temporal Shift in Predictive Evidence: To characterize how predictive evidence evolves with earliness, we examine feature importance and linear coefficients across cutoffs. At early cutoffs, behavioral engagement dominates. At t=7, engagement volume and activity regularity are most influential; for example, total_clicks_t is the top fe… view at source ↗

read the original abstract

Early-warning models built from Learning Management System (LMS) logs aim to predict end-of-course outcomes early enough to enable timely learner support. However, reported "early" performance is often inflated by temporal leakage. This occurs when the pipeline uses information that would not yet be available at the time of prediction. We formalize cutoff-based early outcome prediction under a temporal availability constraint and introduce LEAP (Leakage-Excluded Early-Availability Protocol), which enforces cutoff-first truncation prior to joins and aggregation and audits feature provenance to prevent post-cutoff evidence from entering the benchmark. We instantiate LEAP on the public Open University Learning Analytics Dataset (OULAD) as a multi-step protocol for leakage-controlled evaluation across weekly cutoffs. Using several standard learning methods, we evaluate performance using ROC-AUC, PR-AUC, Brier score, and F1@0.5. Results show improving performance as the observation window expands, with a marked gain around week~3; Random Forest performs best at the earliest cutoffs, while Gradient Boosting dominates thereafter. Leakage ablations further show that temporal violations, especially through assessment information, can inflate apparent "early" performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LEAP gives a concrete cutoff-first protocol for leakage control in LMS early warnings on OULAD, with useful ablations, but the timestamp reliability assumption needs checking.

read the letter

The main takeaway is that this paper defines LEAP as a multi-step protocol that truncates LMS logs at the weekly prediction cutoff before any joins or aggregation, then audits feature provenance to block post-cutoff data. They apply it to OULAD and run standard models with ROC-AUC, PR-AUC, Brier score, and F1, showing performance gains as the window grows and a clear jump around week 3, plus ablations that quantify how assessment leakage inflates early results.

What works is the explicit focus on temporal availability and the use of a public dataset with multiple metrics. The ablations on assessment information are straightforward and show the practical size of the problem. This addresses a real evaluation flaw that affects student-support decisions in learning analytics.

The soft spot is the reliance on OULAD timestamps being granular and accurate enough for exact weekly truncation. If tables contain daily aggregates or lagged submission records, rows kept under the cutoff could still embed information unavailable at prediction time. The paper should verify this directly rather than assume the schema supports it. The abstract lacks equations or full feature lists, so the full text needs to demonstrate the implementation matches the claimed guarantee.

This is for researchers building or evaluating early-warning systems in educational data mining. A reader working on time-aware benchmarks or LMS analytics would find the protocol and ablation results worth examining.

It deserves peer review because it targets a methodological issue with a reproducible public-data setup, even if the timestamp concern requires clarification.

Referee Report

1 major / 2 minor

Summary. The paper formalizes cutoff-based early outcome prediction from LMS logs under a temporal availability constraint to avoid leakage, introduces the LEAP protocol that performs cutoff-first truncation before any joins or aggregation plus feature provenance auditing, and instantiates it as a multi-step evaluation on the OULAD dataset. Using standard classifiers it reports ROC-AUC, PR-AUC, Brier, and F1 trends across weekly cutoffs, notes a performance jump around week 3, identifies Random Forest as strongest at earliest cutoffs and Gradient Boosting later, and shows via ablations that assessment-related leakage inflates early performance.

Significance. If the LEAP protocol is shown to be correctly implemented and the OULAD timestamps support the claimed truncation, the work supplies a reusable, auditable benchmark that directly tackles a pervasive source of over-optimism in learning-analytics early-warning literature. The explicit separation of the protocol definition from any fitted model parameters and the use of a public external dataset are strengths that would make the contribution reproducible and extensible.

major comments (1)

[§4] §4 (LEAP instantiation on OULAD): the central guarantee that cutoff-first truncation prevents post-cutoff evidence rests on the assumption that every event row in studentVle, assessments, and related tables carries a timestamp whose precision and correctness allow exact filtering at each weekly cutoff. The manuscript provides no sensitivity analysis or documentation of timestamp granularity, daily aggregation effects, or known submission-time lags in OULAD; without this the reported leakage ablations and performance curves cannot be verified to be leakage-free.

minor comments (2)

[Methods] The abstract and methods would benefit from an explicit enumerated list of the exact features retained after each weekly truncation and the precise join order used in the LEAP pipeline.
[Results] Figure captions should state the exact number of students and positive-class prevalence at each cutoff to allow readers to interpret the PR-AUC and F1 values.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of the LEAP protocol as a reusable benchmark. We respond to the major comment below.

read point-by-point responses

Referee: [§4] §4 (LEAP instantiation on OULAD): the central guarantee that cutoff-first truncation prevents post-cutoff evidence rests on the assumption that every event row in studentVle, assessments, and related tables carries a timestamp whose precision and correctness allow exact filtering at each weekly cutoff. The manuscript provides no sensitivity analysis or documentation of timestamp granularity, daily aggregation effects, or known submission-time lags in OULAD; without this the reported leakage ablations and performance curves cannot be verified to be leakage-free.

Authors: We agree that the manuscript would benefit from explicit documentation of timestamp handling to support verifiability. OULAD records VLE interactions at daily granularity and assessment submissions with exact dates; the revised manuscript will add a dedicated paragraph in §4 describing these formats, confirming that all filtering uses the provided timestamps, and noting that the dataset documentation does not specify additional submission-time lags. We will also include a short sensitivity analysis comparing nominal weekly cutoffs against one-day shifts to assess robustness to daily aggregation effects. These additions will be made without changing the reported performance trends or leakage ablations. revision: yes

Circularity Check

0 steps flagged

No circularity: LEAP is an independently specified protocol applied to external data

full rationale

The paper defines a cutoff-first truncation protocol (LEAP) as a methodological safeguard against temporal leakage and applies it to the public OULAD dataset using standard classifiers and metrics (ROC-AUC, etc.). No equations, fitted parameters, or self-citations reduce the reported results back to the protocol definition itself. The central contribution is the protocol specification, which stands independently of any outcome metrics. This matches the default case of a self-contained methodological paper with no load-bearing reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields minimal explicit assumptions; the protocol implicitly rests on dataset properties rather than new mathematical axioms or fitted constants.

axioms (1)

domain assumption LMS interaction logs contain reliable timestamps permitting exact cutoff-based truncation
Required for the cutoff-first truncation step to be feasible without additional data cleaning or loss.

invented entities (1)

LEAP protocol no independent evidence
purpose: Enforce temporal availability constraint and feature provenance audit in early-prediction pipelines
Newly introduced named method whose correctness is demonstrated only within the paper's own experiments.

pith-pipeline@v0.9.1-grok · 5748 in / 1353 out tokens · 45919 ms · 2026-06-29T21:17:12.616361+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Deterministic Decisions for High-Stakes AI. A Zero-Egress Pipeline with the Deployability of RAG and the Accuracy of Machine Learning
cs.LG 2026-06 unverdicted novelty 5.0

Zero-shot LLMs exhibit intervention bias in educational advising, over-recommending actions by 43 percentage points, while supervised DT and XGBoost models achieve near-zero calibration error and macro-F1 of 0.79.

Reference graph

Works this paper leans on

22 extracted references · 2 canonical work pages · cited by 1 Pith paper

[1]

International Journal of Educational Technol- ogy in Higher Education16(1), 1–20 (2019)

Akçapınar, G., Altun, A., Aşkar, P.: Using learning analytics to develop early- warning system for at-risk students. International Journal of Educational Technol- ogy in Higher Education16(1), 1–20 (2019)

2019
[2]

Computers & Education158(2020)

Bernacki, M.L., Chavez, M.M., Uesbeck, P.M.: Predicting achievement and provid- ing support before stem majors begin to fail. Computers & Education158(2020)

2020
[3]

Chaka, C.: Educational data mining, student academic performance prediction, prediction methods, algorithms and tools: An overview of reviews (2021)

2021
[4]

IEEE Trans- actions on Learning Technologies10(1), 17–29 (2016)

Conijn, R., Snijders, C., Kleingeld, A., Matzat, U.: Predicting student performance from lms data: A comparison of 17 blended courses using moodle lms. IEEE Trans- actions on Learning Technologies10(1), 17–29 (2016)

2016
[5]

In: Proceedings of the 23rd international conference on Machine learning

Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on Machine learning. pp. 233– 240 (2006)

2006
[6]

Journal of the American Medical Informatics Association31(1), 274–280 (09 2023)

Davis,S.E.,Matheny,M.E.,Balu,S.,Sendak,M.P.:Aframeworkforunderstanding label leakage in machine learning for health care. Journal of the American Medical Informatics Association31(1), 274–280 (09 2023)

2023
[7]

In: Proceedings of the 17th International Conference on Educational Data Mining

Esbenshade, L., Vitale, J., Baker, R.S.: Non-overlapping leave future out valida- tion (nolfo): Implications for graduation prediction. In: Proceedings of the 17th International Conference on Educational Data Mining. pp. 602–609 (2024)

2024
[8]

Pattern recognition letters27(8), 861–874 (2006)

Fawcett, T.: An introduction to roc analysis. Pattern recognition letters27(8), 861–874 (2006)

2006
[9]

Monthly weather review78(1), 1–3 (1950)

Glenn, W.B., et al.: Verification of forecasts expressed in terms of probability. Monthly weather review78(1), 1–3 (1950)

1950
[10]

Journal of the American statistical Association102(477), 359–378 (2007)

Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and estima- tion. Journal of the American statistical Association102(477), 359–378 (2007)

2007
[11]

Computers in Human Behavior36(2014)

Hu, Y.H., Lo, C.L., Shih, S.P.: Developing early warning systems to predict stu- dents’ online learning performance. Computers in Human Behavior36(2014)

2014
[12]

Patterns4(9) (2023)

Kapoor, S., Narayanan, A.: Leakage and the reproducibility crisis in machine- learning-based science. Patterns4(9) (2023)

2023
[13]

ACM Transactions on Knowledge Discovery from Data (TKDD)6(4), 1–21 (2012)

Kaufman, S., Rosset, S., Perlich, C., Stitelman, O.: Leakage in data mining: For- mulation, detection, and avoidance. ACM Transactions on Knowledge Discovery from Data (TKDD)6(4), 1–21 (2012)

2012
[14]

Scientific data4(1), 1–8 (2017)

Kuzilek, J., Hlosta, M., Zdrahal, Z.: Open university learning analytics dataset. Scientific data4(1), 1–8 (2017)

2017
[15]

arXiv preprint arXiv:2510.11313 (2025)

Le, N.L., Abel, M.H.: Automated skill decomposition meets expert ontologies: Bridging the granularity gap with llms. arXiv preprint arXiv:2510.11313 (2025)

work page arXiv 2025
[16]

How Well Do LLMs Predict Prerequisite Skills? Zero-Shot Comparison to Expert-Defined Concepts,

Le, N.L., Abel, M.H.: How well do llms predict prerequisite skills? zero-shot com- parison to expert-defined concepts. arXiv preprint arXiv:2507.18479 (2025)

work page arXiv 2025
[17]

earlywarningsystem

Macfadyen,L.P.,Dawson,S.:Mininglmsdatatodevelopan“earlywarningsystem” for educators: A proof of concept. Computers & education54(2), 588–599 (2010)

2010
[18]

the Journal of machine Learning research12, 2825–2830 (2011)

Pedregosa, F., Varoquaux, G., et al.: Scikit-learn: Machine learning in python. the Journal of machine Learning research12, 2825–2830 (2011)

2011
[19]

PloS one10(3), e0118432 (2015)

Saito, T., Rehmsmeier, M.: The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one10(3), e0118432 (2015)

2015
[20]

Computers and Education: Arti- ficial Intelligence5, 100175 (2023) 14 NL Le et al

Santos, R.M., Henriques, R.: Accurate, timely, and portable: Course-agnostic early prediction of student performance from lms logs. Computers and Education: Arti- ficial Intelligence5, 100175 (2023) 14 NL Le et al

2023
[21]

Iscience28(11) (2025)

Tiggeloven, T., Pfeiffer, S., et al.: The role of artificial intelligence for early warning systems: Status, applicability, guardrails, and ways forward. Iscience28(11) (2025)

2025
[22]

In: Proceedings of the joint IBM/University of Newcastle upon tyne seminar on data base systems

Van Rijsbergen, C.: Information retrieval: theory and practice. In: Proceedings of the joint IBM/University of Newcastle upon tyne seminar on data base systems. vol. 79, pp. 1–14 (1979)

1979

[1] [1]

International Journal of Educational Technol- ogy in Higher Education16(1), 1–20 (2019)

Akçapınar, G., Altun, A., Aşkar, P.: Using learning analytics to develop early- warning system for at-risk students. International Journal of Educational Technol- ogy in Higher Education16(1), 1–20 (2019)

2019

[2] [2]

Computers & Education158(2020)

Bernacki, M.L., Chavez, M.M., Uesbeck, P.M.: Predicting achievement and provid- ing support before stem majors begin to fail. Computers & Education158(2020)

2020

[3] [3]

Chaka, C.: Educational data mining, student academic performance prediction, prediction methods, algorithms and tools: An overview of reviews (2021)

2021

[4] [4]

IEEE Trans- actions on Learning Technologies10(1), 17–29 (2016)

Conijn, R., Snijders, C., Kleingeld, A., Matzat, U.: Predicting student performance from lms data: A comparison of 17 blended courses using moodle lms. IEEE Trans- actions on Learning Technologies10(1), 17–29 (2016)

2016

[5] [5]

In: Proceedings of the 23rd international conference on Machine learning

Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on Machine learning. pp. 233– 240 (2006)

2006

[6] [6]

Journal of the American Medical Informatics Association31(1), 274–280 (09 2023)

Davis,S.E.,Matheny,M.E.,Balu,S.,Sendak,M.P.:Aframeworkforunderstanding label leakage in machine learning for health care. Journal of the American Medical Informatics Association31(1), 274–280 (09 2023)

2023

[7] [7]

In: Proceedings of the 17th International Conference on Educational Data Mining

Esbenshade, L., Vitale, J., Baker, R.S.: Non-overlapping leave future out valida- tion (nolfo): Implications for graduation prediction. In: Proceedings of the 17th International Conference on Educational Data Mining. pp. 602–609 (2024)

2024

[8] [8]

Pattern recognition letters27(8), 861–874 (2006)

Fawcett, T.: An introduction to roc analysis. Pattern recognition letters27(8), 861–874 (2006)

2006

[9] [9]

Monthly weather review78(1), 1–3 (1950)

Glenn, W.B., et al.: Verification of forecasts expressed in terms of probability. Monthly weather review78(1), 1–3 (1950)

1950

[10] [10]

Journal of the American statistical Association102(477), 359–378 (2007)

Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and estima- tion. Journal of the American statistical Association102(477), 359–378 (2007)

2007

[11] [11]

Computers in Human Behavior36(2014)

Hu, Y.H., Lo, C.L., Shih, S.P.: Developing early warning systems to predict stu- dents’ online learning performance. Computers in Human Behavior36(2014)

2014

[12] [12]

Patterns4(9) (2023)

Kapoor, S., Narayanan, A.: Leakage and the reproducibility crisis in machine- learning-based science. Patterns4(9) (2023)

2023

[13] [13]

ACM Transactions on Knowledge Discovery from Data (TKDD)6(4), 1–21 (2012)

Kaufman, S., Rosset, S., Perlich, C., Stitelman, O.: Leakage in data mining: For- mulation, detection, and avoidance. ACM Transactions on Knowledge Discovery from Data (TKDD)6(4), 1–21 (2012)

2012

[14] [14]

Scientific data4(1), 1–8 (2017)

Kuzilek, J., Hlosta, M., Zdrahal, Z.: Open university learning analytics dataset. Scientific data4(1), 1–8 (2017)

2017

[15] [15]

arXiv preprint arXiv:2510.11313 (2025)

Le, N.L., Abel, M.H.: Automated skill decomposition meets expert ontologies: Bridging the granularity gap with llms. arXiv preprint arXiv:2510.11313 (2025)

work page arXiv 2025

[16] [16]

How Well Do LLMs Predict Prerequisite Skills? Zero-Shot Comparison to Expert-Defined Concepts,

Le, N.L., Abel, M.H.: How well do llms predict prerequisite skills? zero-shot com- parison to expert-defined concepts. arXiv preprint arXiv:2507.18479 (2025)

work page arXiv 2025

[17] [17]

earlywarningsystem

Macfadyen,L.P.,Dawson,S.:Mininglmsdatatodevelopan“earlywarningsystem” for educators: A proof of concept. Computers & education54(2), 588–599 (2010)

2010

[18] [18]

the Journal of machine Learning research12, 2825–2830 (2011)

Pedregosa, F., Varoquaux, G., et al.: Scikit-learn: Machine learning in python. the Journal of machine Learning research12, 2825–2830 (2011)

2011

[19] [19]

PloS one10(3), e0118432 (2015)

Saito, T., Rehmsmeier, M.: The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one10(3), e0118432 (2015)

2015

[20] [20]

Computers and Education: Arti- ficial Intelligence5, 100175 (2023) 14 NL Le et al

Santos, R.M., Henriques, R.: Accurate, timely, and portable: Course-agnostic early prediction of student performance from lms logs. Computers and Education: Arti- ficial Intelligence5, 100175 (2023) 14 NL Le et al

2023

[21] [21]

Iscience28(11) (2025)

Tiggeloven, T., Pfeiffer, S., et al.: The role of artificial intelligence for early warning systems: Status, applicability, guardrails, and ways forward. Iscience28(11) (2025)

2025

[22] [22]

In: Proceedings of the joint IBM/University of Newcastle upon tyne seminar on data base systems

Van Rijsbergen, C.: Information retrieval: theory and practice. In: Proceedings of the joint IBM/University of Newcastle upon tyne seminar on data base systems. vol. 79, pp. 1–14 (1979)

1979