Context-Aware Prediction of Student Quiz Performance with Multimodal Textbook Features

Samin Khan

arxiv: 2606.24770 · v1 · pith:ETDGFJUBnew · submitted 2026-05-28 · 💻 cs.CY · cs.AI· cs.LG

Context-Aware Prediction of Student Quiz Performance with Multimodal Textbook Features

Samin Khan This is my paper

Pith reviewed 2026-06-29 00:58 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.LG

keywords student performance predictionmultimodal featureseducational data miningquiz predictioncontent featurestextbook visualsCourseKata

0 comments

The pith

Text and image features from textbook chapters improve quiz score prediction by 9.1% over prior performance alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether lightweight content features from chapter review questions and textbook visuals can predict end-of-chapter quiz scores better than a student's average prior exercise performance. It combines 2023 CourseKata student response data with text features from review-question wording and image features from textbook visuals. Across 4,742 student-chapter observations from 562 class-student IDs, adding these content features yields a 9.1% relative improvement in student-grouped five-fold cross-validation performance. Text features reduce error in leave-chapter-out validation, while models with image features show higher error than the baseline. The results indicate that assessment content supplies independent signal about student outcomes beyond past performance.

Core claim

The study establishes that multimodal features extracted from review-question text and textbook images enhance prediction of quiz performance, delivering a 9.1% relative gain in student-grouped five-fold cross-validation accuracy over a baseline using only average prior exercise performance, across 4,742 observations from 562 class-student IDs. In leave-chapter-out validation, text features lower prediction error relative to the baseline while image-containing models raise it.

What carries the argument

Chapter-level multimodal content features consisting of text features from review-question wording and image features from textbook visuals, used to augment a prior-performance baseline for predicting end-of-chapter quiz scores.

Load-bearing premise

The chapter-level text and image features extracted from review-question wording and textbook visuals supply predictive signal that is not already contained in a student's average prior exercise performance.

What would settle it

A replication on a separate dataset or platform where adding the same text and image features produces no improvement or increases error in comparable student-grouped five-fold cross-validation would falsify the claim of useful additional signal.

Figures

Figures reproduced from arXiv: 2606.24770 by Samin Khan.

**Figure 2.** Figure 2: Standardized Ridge coefficients for the full model. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Educational platforms often predict student performance from prior interactions, but the assessment content itself also varies in linguistic and visual complexity. This paper studies whether lightweight content features extracted from CourseKata chapter-review questions improve prediction of end-of-chapter quiz scores beyond a student's average prior exercise performance. The study combines 2023 CourseKata student response data with chapter-level text features from review-question wording and image features from textbook visuals. Across 4,742 student-chapter observations from 562 class-student IDs, adding content features improves student-grouped five-fold quiz prediction performance by 9.1% relative to a prior-performance baseline. In leave-chapter-out validation, text features reduce prediction error relative to the baseline, while image-containing models have higher error. This paper suggests that a context-aware model adds useful signal about the text and visual features of questions to better predict student quiz performance compared with using past student performance alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Text features add some signal for quiz prediction but image features increase error in leave-chapter-out validation, so the multimodal claim does not hold up.

read the letter

The paper's main result is that text features from review-question wording improve quiz score prediction by 9.1% relative to a prior-performance baseline on CourseKata data, while image features from textbook visuals raise error in the leave-chapter-out split. This differential is the clearest takeaway.

The work uses 4742 student-chapter observations and student-grouped five-fold cross-validation, which is a sensible choice to reduce leakage. Showing that content features can supply independent signal beyond past performance, at least for text, is a straightforward extension worth documenting.

The soft spot is the gap between the multimodal framing and the actual numbers. The abstract presents the 9.1% as a multimodal gain, yet the validation shows images hurt performance while text helps. Because chapters are already held out, the image degradation cannot be dismissed as simple overfitting. The paper would read better if it either dropped the multimodal language or explained the image failure. There are also no model details, feature definitions, error bars, or statistical tests, which leaves the reported improvement hard to assess.

This is for researchers in learning analytics who work with textbook platform data. A reader building similar predictors might find the text-versus-image split useful as a caution. The concrete dataset and validation design are enough to justify peer review, though the multimodal interpretation needs tightening and more methods transparency would help.

Referee Report

2 major / 1 minor

Summary. The paper claims that lightweight multimodal content features (text from review-question wording and images from textbook visuals) extracted from CourseKata chapters improve prediction of end-of-chapter quiz scores beyond a student's average prior exercise performance. Across 4,742 student-chapter observations from 562 class-student IDs, it reports a 9.1% relative improvement in student-grouped five-fold cross-validation when adding these features to the baseline; text features reduce error in leave-chapter-out validation while image-containing models increase error. The paper concludes that context-aware models add useful signal from text and visual features.

Significance. If the central result holds after clarification, the work could contribute to educational prediction by showing that chapter-level content features supply signal independent of prior performance. The differential text vs. image results and the absence of model details, error bars, or tests in the abstract limit the assessed significance, as the multimodal aspect central to the title and conclusion appears unsupported by the reported leave-chapter-out outcomes.

major comments (2)

[Abstract] Abstract: The claim that multimodal (text + image) content features improve quiz prediction is not supported by the reported leave-chapter-out validation results, where image-containing models have higher error than the prior-performance baseline while text-only reduces error. This indicates the 9.1% relative gain in five-fold CV is likely attributable to text features alone, directly challenging whether the multimodal construction supplies independent predictive value.
[Abstract] Abstract: The central claim of a 9.1% relative improvement lacks supporting details on the prediction model, exact feature definitions, error bars, statistical tests, or how the multimodal features are combined, preventing verification that the content features are not already contained in the prior-performance baseline.

minor comments (1)

[Abstract] The abstract should explicitly distinguish the five-fold CV result from the leave-chapter-out results and qualify the multimodal conclusion accordingly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on the abstract. We address each major comment below and agree that revisions are needed to better align the abstract with the reported results and to include key clarifying details.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that multimodal (text + image) content features improve quiz prediction is not supported by the reported leave-chapter-out validation results, where image-containing models have higher error than the prior-performance baseline while text-only reduces error. This indicates the 9.1% relative gain in five-fold CV is likely attributable to text features alone, directly challenging whether the multimodal construction supplies independent predictive value.

Authors: We agree that the leave-chapter-out results show image-containing models increase error relative to the baseline while text features reduce it, indicating the 9.1% gain in student-grouped five-fold CV is driven by text. We will revise the abstract to state that the improvement is attributable to text features from review questions, with image features not contributing additional value in the leave-chapter-out setting. This will clarify that the multimodal title and conclusion refer to the feature extraction approach rather than a joint performance gain from both modalities. revision: yes
Referee: [Abstract] Abstract: The central claim of a 9.1% relative improvement lacks supporting details on the prediction model, exact feature definitions, error bars, statistical tests, or how the multimodal features are combined, preventing verification that the content features are not already contained in the prior-performance baseline.

Authors: The full manuscript details the model (ridge regression on student history plus chapter features), exact text (TF-IDF and embeddings from question wording) and image (ResNet features from textbook visuals) definitions, and cross-validation metrics in the Methods and Results sections. We will revise the abstract to briefly note the model, that content features are extracted independently from chapter review questions and visuals (distinct from student prior performance), and that gains are assessed via grouped cross-validation. Error bars and formal tests are reported in the full results but cannot be fully enumerated in the abstract due to length constraints; we will add a clause on statistical assessment via cross-validation folds. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical ML evaluation stands on independent cross-validation

full rationale

The manuscript reports an empirical comparison of quiz-score predictors: a baseline using each student's average prior exercise performance versus models that additionally ingest chapter-level text and image features. No equations, ansatzes, uniqueness theorems, or self-citations appear in the abstract or description that would make the reported 9.1% relative improvement equivalent to the input features by construction. Performance is assessed via student-grouped five-fold cross-validation and leave-chapter-out validation on held-out observations, rendering the result statistically falsifiable rather than tautological. The observation that image features increase error is a correctness concern, not a circularity reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper is an empirical machine-learning study; its central claim rests on the assumption that the extracted content features are independent of the baseline prior-performance variable and that the chosen validation splits adequately control for student and chapter effects. No new physical entities are postulated.

free parameters (1)

prediction model coefficients
Any regression or ML model combining prior performance with content features will have coefficients fitted to the 4,742 observations.

axioms (1)

domain assumption Student quiz performance is predictable from prior exercise averages plus lightweight text and image features of the assessment content
This premise is required for the comparison of baseline versus augmented models to be meaningful.

pith-pipeline@v0.9.1-grok · 5679 in / 1362 out tokens · 43578 ms · 2026-06-29T00:58:17.743853+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 6 canonical work pages

[1]

Payne, and Valentina Tamma

Samah Alkhuzaey, Floriana Grasso, Terry R. Payne, and Valentina Tamma. 2023. Text-Based Question Difficulty Prediction: A Systematic Review of Automatic Approaches.International Journal of Artificial Intelligence in Education34, 3 (2023), 862–914. doi:10.1007/s40593-023-00362-1

work page doi:10.1007/s40593-023-00362-1 2023
[2]

Gary Bradski. 2000. The OpenCV Library.Dr. Dobb’s Journal of Software Tools 25 (2000), 120–125

2000
[3]

Knowledge tracing: Modeling the acquisition of procedural knowl- edge.User modeling and user-adapted interaction, 4(4): 253–278, 1994

Albert T. Corbett and John R. Anderson. 1995. Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge.User Modeling and User-Adapted Interaction4, 4 (1995), 253–278. doi:10.1007/BF01099821

work page doi:10.1007/bf01099821 1995
[4]

Qi Liu, Zhenya Huang, Yu Yin, Enhong Chen, Hui Xiong, Yu Su, and Guoping Hu. 2021. EKT: Exercise-Aware Knowledge Tracing for Student Performance Prediction.IEEE Transactions on Knowledge and Data Engineering33, 1 (2021), 100–115. doi:10.1109/TKDE.2019.2924374

work page doi:10.1109/tkde.2019.2924374 2021
[5]

Guibas, and Jascha Sohl-Dickstein

Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J. Guibas, and Jascha Sohl-Dickstein. 2015. Deep Knowledge Tracing. InAdvances in Neural Information Processing Systems, Vol. 28. Curran Associates, Inc., Red Hook, NY, USA, 505–513. https://papers.nips.cc/paper/5654-deep- knowledge-tracing

2015
[6]

Son, Adam B

Ji Y. Son, Adam B. Blake, Laura Fries, and James W. Stigler. 2021. Modeling First: Applying Learning Science to the Teaching of Introductory Statistics.Journal Samin Khan of Statistics and Data Science Education29, 1 (2021), 4–21. doi:10.1080/10691898. 2020.1844106

work page doi:10.1080/10691898 2021
[7]

Son and James W

Ji Y. Son and James W. Stigler. 2017–2026.Statistics and Data Science: A Modeling Approach. CourseKata, Los Angeles. https://coursekata.org/preview/default/ program Currently available in 7 versions

2017
[8]

Robyn Speer. 2022. rspeer/wordfreq: v3.0. https://zenodo.org/records/7199437. doi:10.5281/zenodo.7199437

work page doi:10.5281/zenodo.7199437 2022
[9]

Lubomír Štěpánek, Jana Dlouhá, and Patrícia Martinková. 2023. Item Difficulty Prediction Using Item Text Features: Comparison of Predictive Performance across Machine-Learning Algorithms.Mathematics11, 19 (2023), 4104. doi:10. 3390/math11194104

2023
[10]

Sijie Wang, Lin Ni, Zeyu Zhang, Xiaoxuan Li, Xianda Zheng, and Jiamou Liu
[11]

doi:10.1016/j.patrec.2024.03.007

Multimodal Prediction of Student Performance: A Fusion of Signed Graph Neural Networks and Large Language Models.Pattern Recognition Letters181 (2024), 1–8. doi:10.1016/j.patrec.2024.03.007

work page doi:10.1016/j.patrec.2024.03.007 2024

[1] [1]

Payne, and Valentina Tamma

Samah Alkhuzaey, Floriana Grasso, Terry R. Payne, and Valentina Tamma. 2023. Text-Based Question Difficulty Prediction: A Systematic Review of Automatic Approaches.International Journal of Artificial Intelligence in Education34, 3 (2023), 862–914. doi:10.1007/s40593-023-00362-1

work page doi:10.1007/s40593-023-00362-1 2023

[2] [2]

Gary Bradski. 2000. The OpenCV Library.Dr. Dobb’s Journal of Software Tools 25 (2000), 120–125

2000

[3] [3]

Knowledge tracing: Modeling the acquisition of procedural knowl- edge.User modeling and user-adapted interaction, 4(4): 253–278, 1994

Albert T. Corbett and John R. Anderson. 1995. Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge.User Modeling and User-Adapted Interaction4, 4 (1995), 253–278. doi:10.1007/BF01099821

work page doi:10.1007/bf01099821 1995

[4] [4]

Qi Liu, Zhenya Huang, Yu Yin, Enhong Chen, Hui Xiong, Yu Su, and Guoping Hu. 2021. EKT: Exercise-Aware Knowledge Tracing for Student Performance Prediction.IEEE Transactions on Knowledge and Data Engineering33, 1 (2021), 100–115. doi:10.1109/TKDE.2019.2924374

work page doi:10.1109/tkde.2019.2924374 2021

[5] [5]

Guibas, and Jascha Sohl-Dickstein

Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J. Guibas, and Jascha Sohl-Dickstein. 2015. Deep Knowledge Tracing. InAdvances in Neural Information Processing Systems, Vol. 28. Curran Associates, Inc., Red Hook, NY, USA, 505–513. https://papers.nips.cc/paper/5654-deep- knowledge-tracing

2015

[6] [6]

Son, Adam B

Ji Y. Son, Adam B. Blake, Laura Fries, and James W. Stigler. 2021. Modeling First: Applying Learning Science to the Teaching of Introductory Statistics.Journal Samin Khan of Statistics and Data Science Education29, 1 (2021), 4–21. doi:10.1080/10691898. 2020.1844106

work page doi:10.1080/10691898 2021

[7] [7]

Son and James W

Ji Y. Son and James W. Stigler. 2017–2026.Statistics and Data Science: A Modeling Approach. CourseKata, Los Angeles. https://coursekata.org/preview/default/ program Currently available in 7 versions

2017

[8] [8]

Robyn Speer. 2022. rspeer/wordfreq: v3.0. https://zenodo.org/records/7199437. doi:10.5281/zenodo.7199437

work page doi:10.5281/zenodo.7199437 2022

[9] [9]

Lubomír Štěpánek, Jana Dlouhá, and Patrícia Martinková. 2023. Item Difficulty Prediction Using Item Text Features: Comparison of Predictive Performance across Machine-Learning Algorithms.Mathematics11, 19 (2023), 4104. doi:10. 3390/math11194104

2023

[10] [10]

Sijie Wang, Lin Ni, Zeyu Zhang, Xiaoxuan Li, Xianda Zheng, and Jiamou Liu

[11] [11]

doi:10.1016/j.patrec.2024.03.007

Multimodal Prediction of Student Performance: A Fusion of Signed Graph Neural Networks and Large Language Models.Pattern Recognition Letters181 (2024), 1–8. doi:10.1016/j.patrec.2024.03.007

work page doi:10.1016/j.patrec.2024.03.007 2024