Analyzing Process Data from Computer-Based Assessments: A Tutorial on Preprocessing, Feature Extraction, and Model-Based Inference

Daeun Hwangbo; Ick Hoon Jin; Junyeong Park; Minjeong Jeon

arxiv: 2604.16900 · v1 · submitted 2026-04-18 · 📊 stat.AP

Analyzing Process Data from Computer-Based Assessments: A Tutorial on Preprocessing, Feature Extraction, and Model-Based Inference

Daeun Hwangbo , Junyeong Park , Minjeong Jeon , Ick Hoon Jin This is my paper

Pith reviewed 2026-05-10 07:12 UTC · model grok-4.3

classification 📊 stat.AP

keywords process datacomputer-based assessmentspreprocessing pipelinefeature extractionn-gram analysishidden Markov modelsdifferential item functioningbehavioral sequences

0 comments

The pith

Raw interaction logs from computer assessments can be cleaned and analyzed to extract behavioral patterns that aid error diagnosis and reduce unfair group differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a complete workflow for turning detailed action logs from digital tests into usable insights. It begins with a cleaning stage that fixes timestamps, removes duplicates, groups actions, and standardizes terms, then moves to feature methods such as n-gram counts and scaling to summarize sequences, plus model methods like hidden Markov models to spot subtasks. Illustrations on one large adult skills dataset show that the resulting clusters separate meaningful patterns most clearly among wrong answers, that scaling features recover the original behavioral measures, and that adding process information to bias checks can flag and lessen construct-irrelevant group differences.

Core claim

A systematic preprocessing pipeline followed by complementary feature-based and model-based analyses extracts diagnostic information from process data; n-gram clusters carry extra value mainly for incorrect respondents, multidimensional scaling features reconstruct observed behaviors, and process-informed differential item functioning analysis identifies and mitigates non-construct sources of group differences.

What carries the argument

The end-to-end framework that converts raw event logs into analysis-ready action sequences through timestamp correction, duplicate removal, block consolidation, and standardization, then applies n-gram weighting, multidimensional scaling, hidden Markov models, and process-augmented differential item functioning checks.

If this is right

N-gram-derived behavioral clusters supply additional diagnostic detail primarily when respondents give incorrect answers.
Features obtained from multidimensional scaling recover the full set of observed behavioral variables in the data.
Incorporating process data into differential item functioning analysis detects and reduces group differences unrelated to the measured construct.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cleaning and extraction steps could be embedded in live adaptive testing platforms to adjust item selection on the fly.
Routine use of process-informed bias checks might shift assessment design toward sequences that minimize unintended group effects.
Extending the workflow to smaller classroom systems could let teachers see common error paths without manual log review.

Load-bearing premise

The described cleaning steps and chosen analysis methods yield unbiased results that hold across different digital assessments and respondent groups.

What would settle it

Running the identical pipeline and methods on logs from another computer-based test where independent external checks exist for response patterns or group fairness, then finding that the derived clusters or adjusted bias estimates fail to match those checks.

Figures

Figures reproduced from arXiv: 2604.16900 by Daeun Hwangbo, Ick Hoon Jin, Junyeong Park, Minjeong Jeon.

**Figure 2.** Figure 2: Excerpt of raw event-level process data for the item [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Excerpt of preprocessed event-level data for the item [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

read the original abstract

Computer-based assessments routinely generate detailed interaction logs -- commonly referred to as process data -- that record every action a respondent performs during task completion, yet systematic preprocessing guidance, integrated analytical workflows, and cross-method consistency checks remain scarce in the literature. This paper provides a unified, end-to-end analytical framework for analyzing process data from large-scale assessments -- covering the full pipeline from raw log preprocessing to model-based inference -- using the Programme for the International Assessment of Adult Competencies (PIAAC) Problem Solving in Technology-Rich Environments (PS-TRE) domain as an illustrative example. We first present a systematic preprocessing pipeline -- including timestamp correction, duplicate removal, action block consolidation, and LLM-assisted standardization -- that transforms raw event-level logs into analysis-ready action sequences. We then review and demonstrate two complementary families of analytical methods. The first consists of feature-based methods and their downstream applications, including descriptive process indicators, n-gram analysis with TF--IDF weighting, multidimensional scaling, and process data-informed differential item functioning (DIF) analysis. The second consists of model-based approaches, namely hidden Markov models and the subtask identification procedure. Empirical illustrations using the United States sample illustrate that n-gram-based behavioral clusters carry differential diagnostic information primarily among incorrect respondents, that multidimentionsl scaling-derived features comprehensively reconstruct observed behavioral variables, and that process-informed DIF analyses can identify and mitigate construct-irrelevant sources of group differences. Reproducible R code implementations are provided for all major techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical tutorial that assembles existing preprocessing and analysis steps for process data into one workflow with R code, but adds no new methods.

read the letter

The main takeaway is that this paper walks through a complete pipeline for turning raw event logs from computer-based assessments into usable sequences, then applies feature-based and model-based techniques, all demonstrated on the US PIAAC PS-TRE sample. It covers timestamp fixes, duplicate removal, action consolidation, and LLM-assisted label standardization before moving to n-grams with TF-IDF, multidimensional scaling, process-informed DIF, hidden Markov models, and subtask identification. Reproducible code is included for the main pieces.

Referee Report

0 major / 3 minor

Summary. The manuscript is a tutorial presenting an end-to-end framework for preprocessing and analyzing process data from computer-based assessments, using the PIAAC PS-TRE domain (US sample) as the running example. It details a preprocessing pipeline (timestamp correction, duplicate removal, action block consolidation, LLM-assisted standardization) that converts raw logs into action sequences, then demonstrates complementary methods: feature-based approaches (descriptive indicators, n-gram analysis with TF-IDF, multidimensional scaling, and process-informed DIF) and model-based approaches (hidden Markov models and subtask identification). The empirical illustrations claim that n-gram behavioral clusters carry differential diagnostic information primarily among incorrect respondents, that MDS-derived features comprehensively reconstruct observed behavioral variables, and that process-informed DIF can identify and mitigate construct-irrelevant group differences. Reproducible R code is supplied for the major techniques.

Significance. If the described pipeline and methods perform as illustrated, the paper supplies a much-needed unified, practical reference for researchers working with interaction logs in large-scale assessments. The combination of standard statistical procedures with reproducible code and concrete illustrations on a public dataset (PIAAC) lowers the barrier to entry and supports more consistent, transparent use of process data for diagnostic and fairness purposes in educational measurement.

minor comments (3)

[Abstract] Abstract: 'multidimentionsl scaling' is a typographical error and should read 'multidimensional scaling'.
The manuscript would benefit from an explicit statement of the US PIAAC sample size, number of items/tasks analyzed, and any exclusion criteria applied before the illustrations, to allow readers to assess generalizability.
Section on n-gram analysis: clarify whether the TF-IDF weighting is applied at the respondent level or aggregated across the sample, as this affects interpretation of the cluster differential information.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review, the recognition of the tutorial's practical value, and the recommendation to accept. We are pleased that the end-to-end framework, reproducible code, and illustrations on the public PIAAC dataset are viewed as lowering barriers for researchers analyzing process data.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a tutorial describing a preprocessing pipeline and standard analytical methods (n-gram analysis, MDS, HMMs, DIF) applied to an external public dataset (PIAAC US sample). All claims are empirical illustrations of what the methods produce on that data; no equations define target quantities in terms of themselves, no predictions reduce to fitted inputs by construction, and no load-bearing self-citations or uniqueness theorems are invoked. Reproducible code is supplied for the conventional procedures used. This is a self-contained methods tutorial with no derivation chain that collapses to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard statistical assumptions for the applied methods rather than introducing new free parameters, axioms, or entities; preprocessing choices and model specifications (e.g., number of HMM states) are treated as user decisions within established frameworks.

axioms (2)

domain assumption Standard assumptions underlying hidden Markov models for sequential behavioral data hold for assessment logs
Invoked for the model-based inference component
domain assumption n-gram representations and TF-IDF weighting capture diagnostically relevant patterns in action sequences
Basis for the feature-based methods

pith-pipeline@v0.9.0 · 5579 in / 1564 out tokens · 98010 ms · 2026-05-10T07:12:48.393775+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Claude sonnet 4.6 system card

Anthropic (2026, February). Claude sonnet 4.6 system card. Technical report, Anthropic. Revised March 6,

work page 2026
[2]

Ou, and V

Arieli-Attali, M., L. Ou, and V. R. Simmering (2019). Understanding test takers’ choices in a self-adapted test: A hidden Markov modeling of process data.Frontiers in Psychology 10,

work page 2019
[3]

Baggenstoss, P. M. (2001). A modified Baum-Welch algorithm for hidden Markov models with multiple observation spaces.IEEE Transactions on Speech and Audio Processing 9(4), 411–416. Baum, L. E., T. Petrie, G. Soules, and N. Weiss (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains.The Annals of...

work page 2001
[4]

Chen, B., Z

Las Vegas, NV. Chen, B., Z. Zhang, N. Langren´ e, and S. Zhu (2025). Unleashing the potential of prompt engineering for large language models.Patterns 6(6). Chen, G., Y. Liu, and Y. Mao (2024). Understanding the log file data from educational and psycho- logical computer-based testing: A scoping review protocol.PLOS ONE 19(5), e0304109. Chen, L., S. Zhang...

work page arXiv 2025
[5]

Davison, M. L. and S. G. Sireci (2000). Multidimensional Scaling. InHandbook of Applied Multi- variate Statistics and Mathematical Modeling, pp. 323–352. Elsevier. Dridi, N. and M. Hadzagic (2018). Akaike and Bayesian information criteria for hidden Markov models.IEEE Signal Processing Letters 26(2), 302–306. Eddy, S. R. (1996). Hidden Markov models.Curre...

work page 2000
[6]

Naumann, H

Goldhammer, F., J. Naumann, H. R¨ olke, A. Stelter, and K. T´ oth (2017). Relating Product Data to Process Data From Computer-based Competency Assessment. InCompetence Assessment in Education: Research, Models and Instruments, pp. 407–425. Springer. Goldhammer, F., J. Naumann, A. Stelter, K. Toth, H. R¨ olke, E. Klieme, and S. A. ZIB (2014). The time on t...

work page 2017
[7]

Kroehne, U. and F. Goldhammer (2018). How to conceptualize, represent, and analyze log data from technology-based assessments? a generic framework and an application to questionnaire items. Behaviormetrika 45(2), 527–563. 37 Kruskal, J. B. and M. Wish (1978).Multidimensional Scaling. Quantitative Applications in the Social Sciences. Sage. Kupiainen, S., M...

work page 2018
[8]

Li, Z., J. Shin, H. Kuang, and A. C. Huggins-Manley (2025). Exploring the evidence to interpret differential item functioning via response process data.Educational and Psychological Measure- ment 85(4), 783–813. Liu, A., A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. (2025). Deepseek-v3. 2: Pushing the frontier of open la...

work page internal anchor Pith review arXiv 2025
[9]

Mitchell, K. J., L. R. Jones, and J. W. Pellegrino (1999).Grading the Nation’s Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. National Academies Press. OECD (2012).Literacy, Numeracy and Problem Solving in Technology-Rich Environments: Frame- work for the OECD Survey of Adult Skills. OECD Publishing. OECD (2013).OECD...

work page 1999

[1] [1]

Claude sonnet 4.6 system card

Anthropic (2026, February). Claude sonnet 4.6 system card. Technical report, Anthropic. Revised March 6,

work page 2026

[2] [2]

Ou, and V

Arieli-Attali, M., L. Ou, and V. R. Simmering (2019). Understanding test takers’ choices in a self-adapted test: A hidden Markov modeling of process data.Frontiers in Psychology 10,

work page 2019

[3] [3]

Baggenstoss, P. M. (2001). A modified Baum-Welch algorithm for hidden Markov models with multiple observation spaces.IEEE Transactions on Speech and Audio Processing 9(4), 411–416. Baum, L. E., T. Petrie, G. Soules, and N. Weiss (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains.The Annals of...

work page 2001

[4] [4]

Chen, B., Z

Las Vegas, NV. Chen, B., Z. Zhang, N. Langren´ e, and S. Zhu (2025). Unleashing the potential of prompt engineering for large language models.Patterns 6(6). Chen, G., Y. Liu, and Y. Mao (2024). Understanding the log file data from educational and psycho- logical computer-based testing: A scoping review protocol.PLOS ONE 19(5), e0304109. Chen, L., S. Zhang...

work page arXiv 2025

[5] [5]

Davison, M. L. and S. G. Sireci (2000). Multidimensional Scaling. InHandbook of Applied Multi- variate Statistics and Mathematical Modeling, pp. 323–352. Elsevier. Dridi, N. and M. Hadzagic (2018). Akaike and Bayesian information criteria for hidden Markov models.IEEE Signal Processing Letters 26(2), 302–306. Eddy, S. R. (1996). Hidden Markov models.Curre...

work page 2000

[6] [6]

Naumann, H

Goldhammer, F., J. Naumann, H. R¨ olke, A. Stelter, and K. T´ oth (2017). Relating Product Data to Process Data From Computer-based Competency Assessment. InCompetence Assessment in Education: Research, Models and Instruments, pp. 407–425. Springer. Goldhammer, F., J. Naumann, A. Stelter, K. Toth, H. R¨ olke, E. Klieme, and S. A. ZIB (2014). The time on t...

work page 2017

[7] [7]

Kroehne, U. and F. Goldhammer (2018). How to conceptualize, represent, and analyze log data from technology-based assessments? a generic framework and an application to questionnaire items. Behaviormetrika 45(2), 527–563. 37 Kruskal, J. B. and M. Wish (1978).Multidimensional Scaling. Quantitative Applications in the Social Sciences. Sage. Kupiainen, S., M...

work page 2018

[8] [8]

Li, Z., J. Shin, H. Kuang, and A. C. Huggins-Manley (2025). Exploring the evidence to interpret differential item functioning via response process data.Educational and Psychological Measure- ment 85(4), 783–813. Liu, A., A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. (2025). Deepseek-v3. 2: Pushing the frontier of open la...

work page internal anchor Pith review arXiv 2025

[9] [9]

Mitchell, K. J., L. R. Jones, and J. W. Pellegrino (1999).Grading the Nation’s Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. National Academies Press. OECD (2012).Literacy, Numeracy and Problem Solving in Technology-Rich Environments: Frame- work for the OECD Survey of Adult Skills. OECD Publishing. OECD (2013).OECD...

work page 1999