Dependencies in Item-Adaptive CAT Data and Differential Item Functioning Detection: A Multilevel Framework

Chingwei David Shin; Dandan Chen Kaptur; Jinming Zhang; Justin Kern

arxiv: 2409.16534 · v2 · submitted 2024-09-25 · 📊 stat.AP

Dependencies in Item-Adaptive CAT Data and Differential Item Functioning Detection: A Multilevel Framework

Dandan Chen Kaptur , Justin Kern , Chingwei David Shin , Jinming Zhang This is my paper

Pith reviewed 2026-05-23 20:43 UTC · model grok-4.3

classification 📊 stat.AP

keywords differential item functioningcomputerized adaptive testingmultilevel modelinglogistic regressionDIF detectionitem biasadaptive testing dependencies

0 comments

The pith

A two-level logistic model accounts for CAT-induced dependencies to improve DIF detection accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a two-level logistic model for detecting differential item functioning in computerized adaptive testing. It argues that adaptive item selection creates systematic dependencies among responses through provisional ability estimates, which single-level models ignore. By modeling these as nuisance effects, the two-level approach reduces false positives in DIF detection. Simulations show better performance especially in short tests with low exposure rates. This matters because ignoring dependencies can lead to incorrect conclusions about item bias in adaptive tests.

Core claim

The two-level logistic model improves control of spurious DIF and maintains competitive statistical power compared to single-level models in CAT settings, particularly under conditions of shorter tests and smaller exposure rates, as demonstrated through Monte Carlo simulations.

What carries the argument

The two-level logistic model that explicitly captures nuisance effects from CAT-induced structural dependencies arising from provisional ability estimates.

If this is right

The model shows improved Type I error control for DIF detection in CAT data.
Performance advantages are more pronounced with shorter tests and smaller exposure rates.
Model convergence varies systematically across conditions, linking inferential accuracy to convergence reliability.
Multilevel modeling is promising for handling dependencies in adaptive testing DIF analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar multilevel approaches could be applied to other types of adaptive assessments beyond CAT.
Future work might explore integrating this with different ability estimators or exposure controls.
Real-world CAT systems could benefit from routine use of such models to validate item fairness.

Load-bearing premise

That the dependencies induced by adaptive item selection through provisional ability estimates can be adequately modeled as nuisance effects in a two-level logistic framework without altering the primary DIF detection.

What would settle it

A simulation or real CAT dataset where the two-level model shows higher Type I error rates or lower power than single-level models under matched conditions would falsify the claim of improved performance.

read the original abstract

Differential item functioning (DIF) detection is an important yet understudied problem in computerized adaptive testing (CAT). In this article, we proposed a two-level logistic model to improve DIF detection in CAT by explicitly accounting for nuisance effects arising from CAT-induced structural dependency. First, we conceptualized that adaptive item selection induces systematic dependencies among examinees and items through provisional ability estimates, whereas traditional single-level DIF methods assume independent observations and may yield misleading results in CAT settings. Then, using a numeric example and Monte Carlo simulations, we compared our proposed two-level model with competing single-level models under various CAT conditions, manipulating test length, exposure control, ability estimator, DIF type, and DIF prevalence. Item-level Type-I error and statistical power conditional on joint model convergence were reported for each model. We showed that the proposed two-level model has improved control of spurious DIF and competitive power relative to single-level models, particularly with shorter tests and smaller exposure rates. However, we observed that the model convergence varied systematically across simulated conditions, highlighting that inferential accuracy and convergence reliability are intertwined in complex CAT DIF settings. Through this study, we underscored both the promise of multilevel DIF modeling in CAT and the need for future research to jointly evaluate convergence and inferential performance when assessing DIF models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The two-level model improves reported Type I error control for DIF in CAT simulations but only conditional on convergence, which varies systematically by condition and model.

read the letter

The main takeaway is that this paper proposes a two-level logistic model to handle dependencies created by adaptive item selection in CAT, and the simulations show it controls spurious DIF better than single-level alternatives, especially on short tests with low exposure rates. They treat the provisional ability estimates as inducing nuisance correlations that standard DIF methods ignore, then compare the models across test length, exposure control, ability estimator, DIF type, and prevalence. The numeric example and Monte Carlo setup are straightforward, and they report item-level Type I error and power where the joint model converges. That part is useful for anyone working on fairness in adaptive testing. The soft spot is the conditioning on convergence. The abstract states that convergence rates differ systematically across conditions and models, and all performance metrics are given only for converged cases. If the two-level model converges less often precisely where the gains appear largest, the comparison does not cover the full operating characteristics of the procedure. They note the intertwining of convergence and inference at the end, which is honest, but it still limits how strongly the results support the claim of improved control. No derivation details or error bars are mentioned in the abstract, and the full text would need to show whether the multilevel specification is correctly identified and whether alternative ways to handle the dependencies were checked. This is for methodologists in psychometrics and applied statistics who deal with CAT data. It is coherent enough on its own terms to deserve a serious referee, even with the convergence issue, because the problem is understudied and the proposed fix is specific enough to evaluate.

Referee Report

2 major / 2 minor

Summary. The paper proposes a two-level logistic model for DIF detection in CAT to explicitly model nuisance dependencies induced by adaptive item selection via provisional ability estimates. Through a numeric example and Monte Carlo simulations varying test length, exposure control, ability estimator, DIF type, and prevalence, it compares the two-level model against single-level alternatives. Results (conditional on joint-model convergence) indicate improved Type-I error control for spurious DIF and competitive power, especially under shorter tests and lower exposure rates, while noting systematic variation in convergence rates across conditions.

Significance. If the conditional simulation results can be shown to hold without selection bias from differential convergence, the multilevel framework would address an important gap by providing a principled way to handle CAT-induced dependencies in DIF analysis, potentially leading to more reliable item fairness assessments in operational adaptive testing.

major comments (2)

[Abstract / Simulation results] Abstract and simulation results section: Type-I error and power are reported only conditional on joint model convergence, yet the text states that convergence 'varied systematically across simulated conditions' (test length, exposure rate, model type). This selection effect is load-bearing for the central claim of superior spurious-DIF control, because if the two-level model converges less frequently precisely in the short-test/low-exposure regimes where the largest gains are reported, the headline improvements are computed on a non-representative subset of replications and do not establish superior operating characteristics for the procedure as a whole.
[Simulation study] Simulation design: No information is given on the number of replications per cell, the exact convergence criterion, or whether failed replications were re-run with different starting values. Without these details it is impossible to judge whether the reported conditional metrics are stable or whether differential non-convergence rates distort the model comparisons.

minor comments (2)

[Abstract] The abstract states that 'item-level Type-I error and statistical power conditional on joint model convergence were reported,' but does not indicate whether standard errors or variability measures across replications are provided in the tables or figures.
[Methods] Notation for the two-level model (random effects for examinee-item dependencies) should be introduced with an explicit equation early in the methods section to allow readers to verify how the CAT-induced structure is parameterized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments highlighting important issues with conditional reporting and simulation transparency. We agree these points warrant revision to strengthen the manuscript and address potential selection effects.

read point-by-point responses

Referee: [Abstract / Simulation results] Abstract and simulation results section: Type-I error and power are reported only conditional on joint model convergence, yet the text states that convergence 'varied systematically across simulated conditions' (test length, exposure rate, model type). This selection effect is load-bearing for the central claim of superior spurious-DIF control, because if the two-level model converges less frequently precisely in the short-test/low-exposure regimes where the largest gains are reported, the headline improvements are computed on a non-representative subset of replications and do not establish superior operating characteristics for the procedure as a whole.

Authors: We acknowledge that conditional reporting on convergence introduces a potential selection bias, and that this is particularly relevant given the systematic variation noted in the manuscript. The paper already emphasizes that inferential accuracy and convergence reliability are intertwined in CAT DIF settings. In revision we will add a dedicated discussion of this limitation, include a table or figure of convergence rates by condition (test length, exposure rate, model), and clarify the scope of the reported advantages. We will also explore adding sensitivity analyses using the existing replications where possible. revision: yes
Referee: [Simulation study] Simulation design: No information is given on the number of replications per cell, the exact convergence criterion, or whether failed replications were re-run with different starting values. Without these details it is impossible to judge whether the reported conditional metrics are stable or whether differential non-convergence rates distort the model comparisons.

Authors: We will revise the simulation design section to report the number of replications per cell, the exact convergence criterion (including software defaults, iteration limits, and tolerance thresholds), and details on handling of non-converged cases, including any re-runs with alternative starting values. These additions will improve reproducibility and allow better assessment of result stability. revision: yes

Circularity Check

0 steps flagged

No circularity: model proposal and simulation results are independent of inputs

full rationale

The paper proposes a two-level logistic model to account for CAT-induced dependencies and evaluates it via Monte Carlo simulations comparing Type-I error and power against single-level alternatives. No equations, fitted parameters, or predictions reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central performance claims rest on new simulation output rather than renaming or re-deriving prior results. Convergence conditioning is a reporting limitation but does not create definitional circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no specific free parameters, axioms, or invented entities are detailed beyond standard logistic model components. The approach relies on the unelaborated assumption that the two-level structure captures nuisance effects.

pith-pipeline@v0.9.0 · 5766 in / 971 out tokens · 27612 ms · 2026-05-23T20:43:53.327372+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

A., Khan, D

Ali, A., Ali, S., Khan, S. A., Khan, D. M., Abbas, K., Khalil, A., Manzoor, S., & Khalil, U. (2019). Sample size issues in multilevel logistic regression models. PLOS ONE, 14(11), e0225427. https://doi.org/10.1371/journal.pone.0225427 American Educational Research Association, American Psychological Association, & National Council on Measurement in Educat...

work page doi:10.1371/journal.pone.0225427 2019
[2]

DeMars, C. E. (2022). The (non)impact of misfitting items in computerized adaptive testing. Journal of Computerized Adaptive Testing, 9(2). Retrieved February 6, 2023, from https://jcatpub.net/index.php/jcat/article/view/93 Dorman, J. P. (2008). The effect of clustering on statistical tests: An illustration using classroom environment data. Educational Ps...

work page doi:10.1080/01443410801954201 2022
[3]

F., & Henry, N

https://doi.org/10.1207/s15324818ame0204_6 Lazarsfeld, P. F., & Henry, N. (1968).Latent structure analysis. Houghton Mifflin. Lei, P.-W., Chen, S.-Y ., & Yu, L. (2006). Comparing methods of assessing differential item functioning in a computerized adaptive testing environment. Journal of Educational Measurement, 43(3), 245–264. https://doi.org/10.1111/j.1...

work page doi:10.1207/s15324818ame0204_6 1968
[4]

COMPUTERIZED ADAPTIVE TESTING 36 Snijders, T. A. B., & Bosker, R. (2011, November 4). Multilevel analysis: An introduction to basic and advanced multilevel modeling (2nd ed.). SAGE Publications, Inc. Stocking, M. L. (1988). Scale drift in on-line calibration. Educational Testing Service. Princeton, NJ. Retrieved October 31, 2022, from http://onlinelibrary...

work page doi:10.1002/j.2330-8516.1988.tb00284.x 2011
[5]

https://doi.org/10.1186/s40536-014-0004-5 Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x Tacq, J. (1986). Van multiniveau probleem naar multiveau analyse. Department of Research Meth...

work page doi:10.1186/s40536-014-0004-5 1990
[6]

Table B1 Statistics of examinees and provisional ability estimates (for part of the items)

Table B1 provides the descriptive statistics for a subset of these items. Table B1 Statistics of examinees and provisional ability estimates (for part of the items). provisional Ability ˆθs Item Parameters Item ρ(y) Intervalsi Examineesii µ σ Min. Max. a b c MP52024 0.07 24 1634 0.58 0.41 -0.39 1.93 1.646 0.441 0.232 MP52039 0.00 7 18 -1.00 0.17 -1.36 -0....

work page 2007

[1] [1]

A., Khan, D

Ali, A., Ali, S., Khan, S. A., Khan, D. M., Abbas, K., Khalil, A., Manzoor, S., & Khalil, U. (2019). Sample size issues in multilevel logistic regression models. PLOS ONE, 14(11), e0225427. https://doi.org/10.1371/journal.pone.0225427 American Educational Research Association, American Psychological Association, & National Council on Measurement in Educat...

work page doi:10.1371/journal.pone.0225427 2019

[2] [2]

DeMars, C. E. (2022). The (non)impact of misfitting items in computerized adaptive testing. Journal of Computerized Adaptive Testing, 9(2). Retrieved February 6, 2023, from https://jcatpub.net/index.php/jcat/article/view/93 Dorman, J. P. (2008). The effect of clustering on statistical tests: An illustration using classroom environment data. Educational Ps...

work page doi:10.1080/01443410801954201 2022

[3] [3]

F., & Henry, N

https://doi.org/10.1207/s15324818ame0204_6 Lazarsfeld, P. F., & Henry, N. (1968).Latent structure analysis. Houghton Mifflin. Lei, P.-W., Chen, S.-Y ., & Yu, L. (2006). Comparing methods of assessing differential item functioning in a computerized adaptive testing environment. Journal of Educational Measurement, 43(3), 245–264. https://doi.org/10.1111/j.1...

work page doi:10.1207/s15324818ame0204_6 1968

[4] [4]

COMPUTERIZED ADAPTIVE TESTING 36 Snijders, T. A. B., & Bosker, R. (2011, November 4). Multilevel analysis: An introduction to basic and advanced multilevel modeling (2nd ed.). SAGE Publications, Inc. Stocking, M. L. (1988). Scale drift in on-line calibration. Educational Testing Service. Princeton, NJ. Retrieved October 31, 2022, from http://onlinelibrary...

work page doi:10.1002/j.2330-8516.1988.tb00284.x 2011

[5] [5]

https://doi.org/10.1186/s40536-014-0004-5 Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x Tacq, J. (1986). Van multiniveau probleem naar multiveau analyse. Department of Research Meth...

work page doi:10.1186/s40536-014-0004-5 1990

[6] [6]

Table B1 Statistics of examinees and provisional ability estimates (for part of the items)

Table B1 provides the descriptive statistics for a subset of these items. Table B1 Statistics of examinees and provisional ability estimates (for part of the items). provisional Ability ˆθs Item Parameters Item ρ(y) Intervalsi Examineesii µ σ Min. Max. a b c MP52024 0.07 24 1634 0.58 0.41 -0.39 1.93 1.646 0.441 0.232 MP52039 0.00 7 18 -1.00 0.17 -1.36 -0....

work page 2007