Analyzing Process Data from Computer-Based Assessments: A Tutorial on Preprocessing, Feature Extraction, and Model-Based Inference
Pith reviewed 2026-05-10 07:12 UTC · model grok-4.3
The pith
Raw interaction logs from computer assessments can be cleaned and analyzed to extract behavioral patterns that aid error diagnosis and reduce unfair group differences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A systematic preprocessing pipeline followed by complementary feature-based and model-based analyses extracts diagnostic information from process data; n-gram clusters carry extra value mainly for incorrect respondents, multidimensional scaling features reconstruct observed behaviors, and process-informed differential item functioning analysis identifies and mitigates non-construct sources of group differences.
What carries the argument
The end-to-end framework that converts raw event logs into analysis-ready action sequences through timestamp correction, duplicate removal, block consolidation, and standardization, then applies n-gram weighting, multidimensional scaling, hidden Markov models, and process-augmented differential item functioning checks.
If this is right
- N-gram-derived behavioral clusters supply additional diagnostic detail primarily when respondents give incorrect answers.
- Features obtained from multidimensional scaling recover the full set of observed behavioral variables in the data.
- Incorporating process data into differential item functioning analysis detects and reduces group differences unrelated to the measured construct.
Where Pith is reading between the lines
- The same cleaning and extraction steps could be embedded in live adaptive testing platforms to adjust item selection on the fly.
- Routine use of process-informed bias checks might shift assessment design toward sequences that minimize unintended group effects.
- Extending the workflow to smaller classroom systems could let teachers see common error paths without manual log review.
Load-bearing premise
The described cleaning steps and chosen analysis methods yield unbiased results that hold across different digital assessments and respondent groups.
What would settle it
Running the identical pipeline and methods on logs from another computer-based test where independent external checks exist for response patterns or group fairness, then finding that the derived clusters or adjusted bias estimates fail to match those checks.
Figures
read the original abstract
Computer-based assessments routinely generate detailed interaction logs -- commonly referred to as process data -- that record every action a respondent performs during task completion, yet systematic preprocessing guidance, integrated analytical workflows, and cross-method consistency checks remain scarce in the literature. This paper provides a unified, end-to-end analytical framework for analyzing process data from large-scale assessments -- covering the full pipeline from raw log preprocessing to model-based inference -- using the Programme for the International Assessment of Adult Competencies (PIAAC) Problem Solving in Technology-Rich Environments (PS-TRE) domain as an illustrative example. We first present a systematic preprocessing pipeline -- including timestamp correction, duplicate removal, action block consolidation, and LLM-assisted standardization -- that transforms raw event-level logs into analysis-ready action sequences. We then review and demonstrate two complementary families of analytical methods. The first consists of feature-based methods and their downstream applications, including descriptive process indicators, n-gram analysis with TF--IDF weighting, multidimensional scaling, and process data-informed differential item functioning (DIF) analysis. The second consists of model-based approaches, namely hidden Markov models and the subtask identification procedure. Empirical illustrations using the United States sample illustrate that n-gram-based behavioral clusters carry differential diagnostic information primarily among incorrect respondents, that multidimentionsl scaling-derived features comprehensively reconstruct observed behavioral variables, and that process-informed DIF analyses can identify and mitigate construct-irrelevant sources of group differences. Reproducible R code implementations are provided for all major techniques.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a tutorial presenting an end-to-end framework for preprocessing and analyzing process data from computer-based assessments, using the PIAAC PS-TRE domain (US sample) as the running example. It details a preprocessing pipeline (timestamp correction, duplicate removal, action block consolidation, LLM-assisted standardization) that converts raw logs into action sequences, then demonstrates complementary methods: feature-based approaches (descriptive indicators, n-gram analysis with TF-IDF, multidimensional scaling, and process-informed DIF) and model-based approaches (hidden Markov models and subtask identification). The empirical illustrations claim that n-gram behavioral clusters carry differential diagnostic information primarily among incorrect respondents, that MDS-derived features comprehensively reconstruct observed behavioral variables, and that process-informed DIF can identify and mitigate construct-irrelevant group differences. Reproducible R code is supplied for the major techniques.
Significance. If the described pipeline and methods perform as illustrated, the paper supplies a much-needed unified, practical reference for researchers working with interaction logs in large-scale assessments. The combination of standard statistical procedures with reproducible code and concrete illustrations on a public dataset (PIAAC) lowers the barrier to entry and supports more consistent, transparent use of process data for diagnostic and fairness purposes in educational measurement.
minor comments (3)
- [Abstract] Abstract: 'multidimentionsl scaling' is a typographical error and should read 'multidimensional scaling'.
- The manuscript would benefit from an explicit statement of the US PIAAC sample size, number of items/tasks analyzed, and any exclusion criteria applied before the illustrations, to allow readers to assess generalizability.
- Section on n-gram analysis: clarify whether the TF-IDF weighting is applied at the respondent level or aggregated across the sample, as this affects interpretation of the cluster differential information.
Simulated Author's Rebuttal
We thank the referee for the positive review, the recognition of the tutorial's practical value, and the recommendation to accept. We are pleased that the end-to-end framework, reproducible code, and illustrations on the public PIAAC dataset are viewed as lowering barriers for researchers analyzing process data.
Circularity Check
No significant circularity
full rationale
The paper is a tutorial describing a preprocessing pipeline and standard analytical methods (n-gram analysis, MDS, HMMs, DIF) applied to an external public dataset (PIAAC US sample). All claims are empirical illustrations of what the methods produce on that data; no equations define target quantities in terms of themselves, no predictions reduce to fitted inputs by construction, and no load-bearing self-citations or uniqueness theorems are invoked. Reproducible code is supplied for the conventional procedures used. This is a self-contained methods tutorial with no derivation chain that collapses to its inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Standard assumptions underlying hidden Markov models for sequential behavioral data hold for assessment logs
- domain assumption n-gram representations and TF-IDF weighting capture diagnostically relevant patterns in action sequences
Reference graph
Works this paper leans on
-
[1]
Anthropic (2026, February). Claude sonnet 4.6 system card. Technical report, Anthropic. Revised March 6,
work page 2026
- [2]
-
[3]
Baggenstoss, P. M. (2001). A modified Baum-Welch algorithm for hidden Markov models with multiple observation spaces.IEEE Transactions on Speech and Audio Processing 9(4), 411–416. Baum, L. E., T. Petrie, G. Soules, and N. Weiss (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains.The Annals of...
work page 2001
-
[4]
Las Vegas, NV. Chen, B., Z. Zhang, N. Langren´ e, and S. Zhu (2025). Unleashing the potential of prompt engineering for large language models.Patterns 6(6). Chen, G., Y. Liu, and Y. Mao (2024). Understanding the log file data from educational and psycho- logical computer-based testing: A scoping review protocol.PLOS ONE 19(5), e0304109. Chen, L., S. Zhang...
-
[5]
Davison, M. L. and S. G. Sireci (2000). Multidimensional Scaling. InHandbook of Applied Multi- variate Statistics and Mathematical Modeling, pp. 323–352. Elsevier. Dridi, N. and M. Hadzagic (2018). Akaike and Bayesian information criteria for hidden Markov models.IEEE Signal Processing Letters 26(2), 302–306. Eddy, S. R. (1996). Hidden Markov models.Curre...
work page 2000
-
[6]
Goldhammer, F., J. Naumann, H. R¨ olke, A. Stelter, and K. T´ oth (2017). Relating Product Data to Process Data From Computer-based Competency Assessment. InCompetence Assessment in Education: Research, Models and Instruments, pp. 407–425. Springer. Goldhammer, F., J. Naumann, A. Stelter, K. Toth, H. R¨ olke, E. Klieme, and S. A. ZIB (2014). The time on t...
work page 2017
-
[7]
Kroehne, U. and F. Goldhammer (2018). How to conceptualize, represent, and analyze log data from technology-based assessments? a generic framework and an application to questionnaire items. Behaviormetrika 45(2), 527–563. 37 Kruskal, J. B. and M. Wish (1978).Multidimensional Scaling. Quantitative Applications in the Social Sciences. Sage. Kupiainen, S., M...
work page 2018
-
[8]
Li, Z., J. Shin, H. Kuang, and A. C. Huggins-Manley (2025). Exploring the evidence to interpret differential item functioning via response process data.Educational and Psychological Measure- ment 85(4), 783–813. Liu, A., A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. (2025). Deepseek-v3. 2: Pushing the frontier of open la...
work page internal anchor Pith review arXiv 2025
-
[9]
Mitchell, K. J., L. R. Jones, and J. W. Pellegrino (1999).Grading the Nation’s Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. National Academies Press. OECD (2012).Literacy, Numeracy and Problem Solving in Technology-Rich Environments: Frame- work for the OECD Survey of Adult Skills. OECD Publishing. OECD (2013).OECD...
work page 1999
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.