MCQ Difficulty Prediction via Modeling Learner Heterogeneity Using Data-Driven Cognitive Profiling

Dhriti Krishnan; Jaromir Savelka

arxiv: 2605.16290 · v1 · pith:TYK7OZGOnew · submitted 2026-04-13 · 💻 cs.CY · cs.AI· cs.LG

MCQ Difficulty Prediction via Modeling Learner Heterogeneity Using Data-Driven Cognitive Profiling

Dhriti Krishnan , Jaromir Savelka This is my paper

Pith reviewed 2026-05-21 01:02 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.LG

keywords MCQ difficultylearner heterogeneitylatent class analysiscognitive profilingitem response theorylarge language modelspersona simulation

0 comments

The pith

Modeling learner behavioral personas from real data improves MCQ difficulty predictions over uniform-ability assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes replacing the assumption of a single average student ability with distinct behavioral groups discovered from real data. These groups, found through latent class analysis on the EEDI dataset, are used to prompt an LLM to simulate how different types of students would answer each question. The resulting response patterns are combined with topic information and fed into a regression model to estimate the IRT difficulty parameter for each item. This method achieves better accuracy than prior approaches that do not model heterogeneity, with cross-validated MSE dropping from 0.367 to 0.274 and R-squared rising from 0.525 to 0.686. The personas also provide interpretable explanations for why specific questions prove difficult for certain learners.

Core claim

By identifying behavioral personas via latent class analysis on student interactions and conditioning an LLM to simulate their responses, the framework generates signals that improve prediction of IRT difficulty parameters when aggregated with topic context in a Ridge Regression model.

What carries the argument

The persona-driven framework that discovers behavioral personas through latent class analysis on the EEDI dataset and conditions LLMs to produce simulated response distributions for each persona.

If this is right

More accurate difficulty estimates for MCQs, as shown by reduced mean squared error in five-fold cross-validation.
Interpretable insights into student misconceptions that explain item difficulty.
Potential applications in designing better diagnostic assessments.
Data-driven cognitive profiling as a replacement for theoretical ability sampling in difficulty modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could extend to predicting difficulty in other question formats by applying similar persona simulation.
Personas discovered this way might be used to create adaptive testing systems that account for different learner profiles.
Future work could test if these personas transfer across different educational domains or datasets.

Load-bearing premise

The behavioral personas identified from the EEDI dataset produce simulated responses that match the actual heterogeneous behaviors and misconceptions of real students.

What would settle it

Collect new student response data on the same MCQs and compare the actual distribution of answers per persona against the LLM-simulated distributions to check for alignment in error rates and patterns.

Figures

Figures reproduced from arXiv: 2605.16290 by Dhriti Krishnan, Jaromir Savelka.

**Figure 1.** Figure 1: The Proposed Pipeline. LCA discovers 5 learner personas from student response data (Step 1). An LLM simulates per-persona response distributions (Step 2), which are aggregated into features (Step 3) and used to predict IRT difficulty via Ridge Regression (Step 4). Ground truth is estimated independently via 2PL-IRT (Step 0). al. [7] observed a similar pattern in distractor generation: LLMs produce mathema… view at source ↗

**Figure 2.** Figure 2: Model selection for psychometric profiling. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: LLM-Generated Description for “The Conceptual Reasoner”. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Predicting the difficulty of multiple-choice questions (MCQs) is important for effective assessment, yet current methods typically assume a unimodal student ability distribution, overlooking the heterogeneous nature of student misconceptions. We propose a persona-driven framework that replaces theoretical ability sampling with data-driven cognitive profiling. Using student interactions from the EEDI dataset, we identify behavioral personas via latent class analysis (LCA), then condition a large language model (LLM) to simulate response distributions for each persona. These signals are aggregated with topic context and fed into a Ridge Regression model to predict the item response theory (IRT) difficulty parameter. With five-fold cross-validation, our method improves over a recent baseline (MSE: 0.367 to 0.274; R2: 0.525 to 0.686). The discovered personas are interpretable and offer insights into why items are difficult, with potential applications to diagnostic assessment design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets a measurable lift in IRT difficulty prediction by layering LCA personas and LLM simulations on top of topic features, but the gain rests on unverified assumptions about how well the simulations match real student behaviors.

read the letter

The main thing to know is that this paper gets a solid bump in MCQ difficulty prediction by finding behavioral personas with latent class analysis on the EEDI data, simulating their responses via LLM, and feeding that into a regression for IRT parameters. The reported cross-validation gains look real enough on the surface. What is new is the full pipeline that ties data-driven profiling to LLM simulation for this specific task. It does well in delivering measurable improvement over the baseline and in pointing out that the personas can be interpreted to explain item difficulty. The soft spot is the missing direct test of whether the LLM outputs actually reproduce the real response distributions for those personas. The improvement could stem from other parts of the model if the simulations are not faithful. Details on the cross-validation setup would also help confirm no data leakage between persona discovery and the target values. This paper is for researchers and practitioners in educational assessment who want to move past uniform ability assumptions in test design. It offers a workable method for incorporating heterogeneity. I would send it for peer review. The concrete numbers and the novel mix of techniques make it worth a closer look by referees who can push on the validation gaps.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a persona-driven framework for predicting the difficulty of multiple-choice questions. It applies latent class analysis (LCA) to student interaction data from the EEDI dataset to discover behavioral personas, conditions an LLM on these personas to simulate response distributions, aggregates the resulting signals with topic context, and uses Ridge regression to predict item response theory (IRT) difficulty parameters. Five-fold cross-validation is reported to yield improvements over a recent baseline (MSE reduced from 0.367 to 0.274; R² increased from 0.525 to 0.686). The personas are described as interpretable and useful for understanding item difficulty.

Significance. If the LLM-simulated responses conditioned on the LCA personas accurately reproduce the heterogeneous misconception patterns present in the real data, the framework would provide a concrete advance over unimodal ability assumptions in difficulty prediction. The reported numerical gains and the interpretability of the personas would support applications in diagnostic assessment design. The use of data-driven profiling rather than theoretical sampling is a positive methodological direction.

major comments (2)

[Method (LCA + LLM simulation pipeline)] The central performance claim (MSE 0.367→0.274, R² 0.525→0.686) depends on the fidelity of the LLM-generated response distributions to the actual per-persona response patterns in the EEDI data. No section reports a direct validation metric such as KL divergence between simulated and observed response distributions, per-item match rates on held-out students from the same latent classes, or calibration plots. Without this check, it remains possible that the regression improvement arises from generic LLM cues or topic features rather than genuine modeling of learner heterogeneity.
[Experimental setup and cross-validation procedure] The five-fold cross-validation description does not explicitly state that persona discovery via LCA was performed strictly inside each training fold. Because the target IRT difficulty parameters are estimated from the same student interaction data used for persona identification, performing LCA on the full dataset before splitting would create leakage and inflate the reported gains.

minor comments (2)

[LCA implementation details] Clarify the exact number of latent classes retained, the model-selection criterion (e.g., BIC, interpretability), and how the persona descriptions were converted into LLM prompts.
[Results and baseline comparison] The abstract and results section should state whether the baseline method was re-implemented and evaluated under identical five-fold splits and feature conditions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of validation and experimental rigor that we address point by point below. We commit to revisions that will strengthen the presentation and evidence without altering the core contributions.

read point-by-point responses

Referee: [Method (LCA + LLM simulation pipeline)] The central performance claim (MSE 0.367→0.274, R² 0.525→0.686) depends on the fidelity of the LLM-generated response distributions to the actual per-persona response patterns in the EEDI data. No section reports a direct validation metric such as KL divergence between simulated and observed response distributions, per-item match rates on held-out students from the same latent classes, or calibration plots. Without this check, it remains possible that the regression improvement arises from generic LLM cues or topic features rather than genuine modeling of learner heterogeneity.

Authors: We agree that explicit validation of the LLM simulation fidelity is necessary to substantiate that gains arise from modeling learner heterogeneity. The original manuscript did not report quantitative metrics such as KL divergence or calibration plots comparing simulated and observed per-persona response distributions. In the revised manuscript we will add a dedicated subsection with these metrics (including average KL divergence across personas and calibration plots on held-out students), along with an ablation replacing persona-conditioned simulations with generic or random conditioning to isolate the contribution of the data-driven profiles. revision: yes
Referee: [Experimental setup and cross-validation procedure] The five-fold cross-validation description does not explicitly state that persona discovery via LCA was performed strictly inside each training fold. Because the target IRT difficulty parameters are estimated from the same student interaction data used for persona identification, performing LCA on the full dataset before splitting would create leakage and inflate the reported gains.

Authors: We confirm that persona discovery via LCA was performed independently within each training fold using only the training portion of the data for that fold. The IRT difficulty parameters serve as the prediction target and are estimated from the full dataset in the standard manner for such tasks; they are not used as features in the LCA step. We acknowledge that the manuscript text did not explicitly describe the fold-wise execution of LCA. In the revision we will update the experimental setup and cross-validation sections to state this procedure clearly and unambiguously. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under cross-validation

full rationale

The paper identifies behavioral personas via latent class analysis on the EEDI dataset, simulates responses with an LLM conditioned on those personas, aggregates signals with topic context, and feeds them into Ridge regression to predict IRT difficulty parameters, reporting improvement under five-fold cross-validation. No quoted equations, definitions, or steps in the provided text reduce the target IRT difficulty prediction to a fitted parameter or self-citation by construction; the personas and simulations are derived from student interactions but the evaluation uses held-out folds against an external baseline, keeping the central empirical claim independently verifiable rather than tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on several modeling choices and assumptions whose values or validity are not independently verified in the abstract; these include the number of latent classes, the fidelity of LLM simulations, and standard IRT assumptions.

free parameters (2)

Number of latent classes
Chosen during LCA to define behavioral personas from the EEDI interaction data.
Ridge regression alpha
Regularization strength fitted or selected for the final difficulty prediction model.

axioms (2)

domain assumption Item response theory assumptions hold for the target difficulty parameter.
The prediction target is defined as the IRT difficulty parameter.
ad hoc to paper Conditioning an LLM on persona descriptions yields response distributions that match real student heterogeneity.
This is the core mechanism for generating additional signals beyond topic context.

invented entities (1)

Behavioral personas no independent evidence
purpose: To capture distinct patterns of student misconceptions and response behavior.
Discovered via LCA but treated as new constructs for conditioning the LLM.

pith-pipeline@v0.9.0 · 5690 in / 1534 out tokens · 62229 ms · 2026-05-21T01:02:55.249493+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

persona-driven framework that replaces theoretical ability sampling with data-driven cognitive profiling... latent class analysis (LCA)... condition a large language model (LLM) to simulate response distributions... Ridge Regression model to predict the item response theory (IRT) difficulty parameter
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Jcost-shaped structural theorems... recognition cost function J(x) = ½(x + x⁻¹) − 1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 2 internal anchors

[1]

Anthropic: Claude 3.7 sonnet (2025)

work page 2025
[2]

Anthropic: Introducing claude opus 4.5 (2025)

work page 2025
[3]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2004
[4]

In: The Twelfth International Conference on Learning Representations (2024)

Binz, M., Schulz, E.: Turning large language models into cognitive models. In: The Twelfth International Conference on Learning Representations (2024)

work page 2024
[5]

Butler, A.C.: Multiple-choice testing in education: Are the best practices for assess- mentalsogoodforlearning?JournalofAppliedResearchinMemoryandCognition 7(3), 323–331 (2018)

work page 2018
[6]

Guilford Pub- lications (2013)

De Ayala, R.J.: The Theory and Practice of Item Response Theory. Guilford Pub- lications (2013)

work page 2013
[7]

In: Findings of NAACL 2024

Feng, W., Lee, J., McNichols, H., Scarlatos, A., Smith, D., Woodhead, S., Ornelas, N., Lan, A.: Exploring automated distractor generation for math multiple-choice questions via large language models. In: Findings of NAACL 2024. pp. 3067–3082 (2024)

work page 2024
[8]

In: Artificial Intelligence in Education (AIED 2025), LNCS

Feng, W., Tran, P., Sireci, S., Lan, A.S.: Reasoning and sampling-augmented mcq difficulty prediction via llms. In: Artificial Intelligence in Education (AIED 2025), LNCS. pp. 31–45. Springer (2025) MCQ Difficulty Prediction through Modeling Learner Heterogeneity 9

work page 2025
[9]

In: Proceedings of BEA Workshop

Ha, L.A., Yaneva, V., Baldwin, P., Mee, J.: Predicting the difficulty of multiple choice questions in a high-stakes medical exam. In: Proceedings of BEA Workshop

work page
[10]

arXiv preprint arXiv:2407.15645 (2024)

He-Yueya, J., Ma, W.A., Gandhi, K., Domingue, B.W., Brunskill, E., Goodman, N.D.: Psychometric alignment: Capturing human knowledge distributions via lan- guage models. arXiv preprint arXiv:2407.15645 (2024)

work page arXiv 2024
[11]

Technometrics12(1), 55–67 (1970)

Hoerl,A.E.,Kennard,R.W.:Ridgeregression:Biasedestimationfornonorthogonal problems. Technometrics12(1), 55–67 (1970)

work page 1970
[12]

In: Pro- ceedings of ACL 2024

Hu, T., Collier, N.: Quantifying the persona effect in LLM simulations. In: Pro- ceedings of ACL 2024. pp. 10289–10307 (2024)

work page 2024
[13]

In: Proceedings of CIKM (2019)

Huang, Z., Qi, Y., Shen, C., Ding, G.: Question difficulty prediction for multiple choice problems in medical exams. In: Proceedings of CIKM (2019)

work page 2019
[14]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Cognitive Science36(5), 757–798 (2012)

Koedinger, K.R., Corbett, A.T., Perfetti, C.: The knowledge-learning-instruction framework: Bridging the science-practice chasm. Cognitive Science36(5), 757–798 (2012)

work page 2012
[16]

The Journal of the Learning Sciences 13(2), 129–164 (2004)

Koedinger, K.R., Nathan, M.J.: The real story behind story problems: Effects of representations on quantitative reasoning. The Journal of the Learning Sciences 13(2), 129–164 (2004)

work page 2004
[17]

In: Proceedings of EMNLP 2019 (2019)

Lalor, J.P., Wu, H., Yu, H.: Learning latent parameters without human response patterns: Item response theory with artificial crowds. In: Proceedings of EMNLP 2019 (2019)

work page 2019
[18]

In: Findings of EMNLP

Park, J.W., Park, S.J., Won, H.S., Kim, K.M.: Large language models are students at various levels: Zero-shot question difficulty estimation. In: Findings of EMNLP

work page
[19]

8157–8177 (2024)

pp. 8157–8177 (2024)

work page 2024
[20]

Uni- versity of Chicago Press (1960)

Rasch, G.: Probabilistic Models for Some Intelligence and Attainment Tests. Uni- versity of Chicago Press (1960)

work page 1960
[21]

In: Proceedings of EMNLP 2025 (2025)

Scarlatos, A., Fernandez, N., Ormerod, C., Lottridge, S., Lan, A.: Smart: Simulated students aligned with item response theory for question difficulty prediction. In: Proceedings of EMNLP 2025 (2025)

work page 2025
[22]

arXiv preprint arXiv:2007.12061 (2020)

Wang, Z., Lamb, A., Saveliev, E., Cameron, P., Zaykov, Y., Hernández-Lobato, J.M., Turner, R.E., Baraniuk, R.G., Barton, C., Jones, S.P., Woodhead, S., Zhang, C.: Diagnostic questions: The neurips 2020 education challenge. arXiv preprint arXiv:2007.12061 (2020)

work page arXiv 2020
[23]

In: Proceedings of BEA Workshop at NAACL 2024

Yaneva, V., North, K., Baldwin, P., Ha, L.A., Rezayi, S., Zhou, Y., Ray Choudhury, S., Harik, P., Clauser, B.: Findings from the first shared task on automated predic- tion of difficulty and response time for multiple-choice questions. In: Proceedings of BEA Workshop at NAACL 2024. pp. 470–482 (2024)

work page 2024
[24]

Shanghai Archives of Psychiatry26(3), 171 (2014)

Yang, F.M., Kao, S.T.: Item response theory for measurement validity. Shanghai Archives of Psychiatry26(3), 171 (2014)

work page 2014
[25]

Yuan, Z., Xiao, Y., Li, M., Xuan, W., Tong, R., Diab, M., Mitchell, T.: Towards valid student simulation with large language models (2026), arXiv:2601.05473

work page arXiv 2026

[1] [1]

Anthropic: Claude 3.7 sonnet (2025)

work page 2025

[2] [2]

Anthropic: Introducing claude opus 4.5 (2025)

work page 2025

[3] [3]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2004

[4] [4]

In: The Twelfth International Conference on Learning Representations (2024)

Binz, M., Schulz, E.: Turning large language models into cognitive models. In: The Twelfth International Conference on Learning Representations (2024)

work page 2024

[5] [5]

Butler, A.C.: Multiple-choice testing in education: Are the best practices for assess- mentalsogoodforlearning?JournalofAppliedResearchinMemoryandCognition 7(3), 323–331 (2018)

work page 2018

[6] [6]

Guilford Pub- lications (2013)

De Ayala, R.J.: The Theory and Practice of Item Response Theory. Guilford Pub- lications (2013)

work page 2013

[7] [7]

In: Findings of NAACL 2024

Feng, W., Lee, J., McNichols, H., Scarlatos, A., Smith, D., Woodhead, S., Ornelas, N., Lan, A.: Exploring automated distractor generation for math multiple-choice questions via large language models. In: Findings of NAACL 2024. pp. 3067–3082 (2024)

work page 2024

[8] [8]

In: Artificial Intelligence in Education (AIED 2025), LNCS

Feng, W., Tran, P., Sireci, S., Lan, A.S.: Reasoning and sampling-augmented mcq difficulty prediction via llms. In: Artificial Intelligence in Education (AIED 2025), LNCS. pp. 31–45. Springer (2025) MCQ Difficulty Prediction through Modeling Learner Heterogeneity 9

work page 2025

[9] [9]

In: Proceedings of BEA Workshop

Ha, L.A., Yaneva, V., Baldwin, P., Mee, J.: Predicting the difficulty of multiple choice questions in a high-stakes medical exam. In: Proceedings of BEA Workshop

work page

[10] [10]

arXiv preprint arXiv:2407.15645 (2024)

He-Yueya, J., Ma, W.A., Gandhi, K., Domingue, B.W., Brunskill, E., Goodman, N.D.: Psychometric alignment: Capturing human knowledge distributions via lan- guage models. arXiv preprint arXiv:2407.15645 (2024)

work page arXiv 2024

[11] [11]

Technometrics12(1), 55–67 (1970)

Hoerl,A.E.,Kennard,R.W.:Ridgeregression:Biasedestimationfornonorthogonal problems. Technometrics12(1), 55–67 (1970)

work page 1970

[12] [12]

In: Pro- ceedings of ACL 2024

Hu, T., Collier, N.: Quantifying the persona effect in LLM simulations. In: Pro- ceedings of ACL 2024. pp. 10289–10307 (2024)

work page 2024

[13] [13]

In: Proceedings of CIKM (2019)

Huang, Z., Qi, Y., Shen, C., Ding, G.: Question difficulty prediction for multiple choice problems in medical exams. In: Proceedings of CIKM (2019)

work page 2019

[14] [14]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Cognitive Science36(5), 757–798 (2012)

Koedinger, K.R., Corbett, A.T., Perfetti, C.: The knowledge-learning-instruction framework: Bridging the science-practice chasm. Cognitive Science36(5), 757–798 (2012)

work page 2012

[16] [16]

The Journal of the Learning Sciences 13(2), 129–164 (2004)

Koedinger, K.R., Nathan, M.J.: The real story behind story problems: Effects of representations on quantitative reasoning. The Journal of the Learning Sciences 13(2), 129–164 (2004)

work page 2004

[17] [17]

In: Proceedings of EMNLP 2019 (2019)

Lalor, J.P., Wu, H., Yu, H.: Learning latent parameters without human response patterns: Item response theory with artificial crowds. In: Proceedings of EMNLP 2019 (2019)

work page 2019

[18] [18]

In: Findings of EMNLP

Park, J.W., Park, S.J., Won, H.S., Kim, K.M.: Large language models are students at various levels: Zero-shot question difficulty estimation. In: Findings of EMNLP

work page

[19] [19]

8157–8177 (2024)

pp. 8157–8177 (2024)

work page 2024

[20] [20]

Uni- versity of Chicago Press (1960)

Rasch, G.: Probabilistic Models for Some Intelligence and Attainment Tests. Uni- versity of Chicago Press (1960)

work page 1960

[21] [21]

In: Proceedings of EMNLP 2025 (2025)

Scarlatos, A., Fernandez, N., Ormerod, C., Lottridge, S., Lan, A.: Smart: Simulated students aligned with item response theory for question difficulty prediction. In: Proceedings of EMNLP 2025 (2025)

work page 2025

[22] [22]

arXiv preprint arXiv:2007.12061 (2020)

Wang, Z., Lamb, A., Saveliev, E., Cameron, P., Zaykov, Y., Hernández-Lobato, J.M., Turner, R.E., Baraniuk, R.G., Barton, C., Jones, S.P., Woodhead, S., Zhang, C.: Diagnostic questions: The neurips 2020 education challenge. arXiv preprint arXiv:2007.12061 (2020)

work page arXiv 2020

[23] [23]

In: Proceedings of BEA Workshop at NAACL 2024

Yaneva, V., North, K., Baldwin, P., Ha, L.A., Rezayi, S., Zhou, Y., Ray Choudhury, S., Harik, P., Clauser, B.: Findings from the first shared task on automated predic- tion of difficulty and response time for multiple-choice questions. In: Proceedings of BEA Workshop at NAACL 2024. pp. 470–482 (2024)

work page 2024

[24] [24]

Shanghai Archives of Psychiatry26(3), 171 (2014)

Yang, F.M., Kao, S.T.: Item response theory for measurement validity. Shanghai Archives of Psychiatry26(3), 171 (2014)

work page 2014

[25] [25]

Yuan, Z., Xiao, Y., Li, M., Xuan, W., Tong, R., Diab, M., Mitchell, T.: Towards valid student simulation with large language models (2026), arXiv:2601.05473

work page arXiv 2026