pith. sign in

arxiv: 2409.16534 · v2 · submitted 2024-09-25 · 📊 stat.AP

Dependencies in Item-Adaptive CAT Data and Differential Item Functioning Detection: A Multilevel Framework

Pith reviewed 2026-05-23 20:43 UTC · model grok-4.3

classification 📊 stat.AP
keywords differential item functioningcomputerized adaptive testingmultilevel modelinglogistic regressionDIF detectionitem biasadaptive testing dependencies
0
0 comments X

The pith

A two-level logistic model accounts for CAT-induced dependencies to improve DIF detection accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a two-level logistic model for detecting differential item functioning in computerized adaptive testing. It argues that adaptive item selection creates systematic dependencies among responses through provisional ability estimates, which single-level models ignore. By modeling these as nuisance effects, the two-level approach reduces false positives in DIF detection. Simulations show better performance especially in short tests with low exposure rates. This matters because ignoring dependencies can lead to incorrect conclusions about item bias in adaptive tests.

Core claim

The two-level logistic model improves control of spurious DIF and maintains competitive statistical power compared to single-level models in CAT settings, particularly under conditions of shorter tests and smaller exposure rates, as demonstrated through Monte Carlo simulations.

What carries the argument

The two-level logistic model that explicitly captures nuisance effects from CAT-induced structural dependencies arising from provisional ability estimates.

If this is right

  • The model shows improved Type I error control for DIF detection in CAT data.
  • Performance advantages are more pronounced with shorter tests and smaller exposure rates.
  • Model convergence varies systematically across conditions, linking inferential accuracy to convergence reliability.
  • Multilevel modeling is promising for handling dependencies in adaptive testing DIF analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar multilevel approaches could be applied to other types of adaptive assessments beyond CAT.
  • Future work might explore integrating this with different ability estimators or exposure controls.
  • Real-world CAT systems could benefit from routine use of such models to validate item fairness.

Load-bearing premise

That the dependencies induced by adaptive item selection through provisional ability estimates can be adequately modeled as nuisance effects in a two-level logistic framework without altering the primary DIF detection.

What would settle it

A simulation or real CAT dataset where the two-level model shows higher Type I error rates or lower power than single-level models under matched conditions would falsify the claim of improved performance.

read the original abstract

Differential item functioning (DIF) detection is an important yet understudied problem in computerized adaptive testing (CAT). In this article, we proposed a two-level logistic model to improve DIF detection in CAT by explicitly accounting for nuisance effects arising from CAT-induced structural dependency. First, we conceptualized that adaptive item selection induces systematic dependencies among examinees and items through provisional ability estimates, whereas traditional single-level DIF methods assume independent observations and may yield misleading results in CAT settings. Then, using a numeric example and Monte Carlo simulations, we compared our proposed two-level model with competing single-level models under various CAT conditions, manipulating test length, exposure control, ability estimator, DIF type, and DIF prevalence. Item-level Type-I error and statistical power conditional on joint model convergence were reported for each model. We showed that the proposed two-level model has improved control of spurious DIF and competitive power relative to single-level models, particularly with shorter tests and smaller exposure rates. However, we observed that the model convergence varied systematically across simulated conditions, highlighting that inferential accuracy and convergence reliability are intertwined in complex CAT DIF settings. Through this study, we underscored both the promise of multilevel DIF modeling in CAT and the need for future research to jointly evaluate convergence and inferential performance when assessing DIF models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a two-level logistic model for DIF detection in CAT to explicitly model nuisance dependencies induced by adaptive item selection via provisional ability estimates. Through a numeric example and Monte Carlo simulations varying test length, exposure control, ability estimator, DIF type, and prevalence, it compares the two-level model against single-level alternatives. Results (conditional on joint-model convergence) indicate improved Type-I error control for spurious DIF and competitive power, especially under shorter tests and lower exposure rates, while noting systematic variation in convergence rates across conditions.

Significance. If the conditional simulation results can be shown to hold without selection bias from differential convergence, the multilevel framework would address an important gap by providing a principled way to handle CAT-induced dependencies in DIF analysis, potentially leading to more reliable item fairness assessments in operational adaptive testing.

major comments (2)
  1. [Abstract / Simulation results] Abstract and simulation results section: Type-I error and power are reported only conditional on joint model convergence, yet the text states that convergence 'varied systematically across simulated conditions' (test length, exposure rate, model type). This selection effect is load-bearing for the central claim of superior spurious-DIF control, because if the two-level model converges less frequently precisely in the short-test/low-exposure regimes where the largest gains are reported, the headline improvements are computed on a non-representative subset of replications and do not establish superior operating characteristics for the procedure as a whole.
  2. [Simulation study] Simulation design: No information is given on the number of replications per cell, the exact convergence criterion, or whether failed replications were re-run with different starting values. Without these details it is impossible to judge whether the reported conditional metrics are stable or whether differential non-convergence rates distort the model comparisons.
minor comments (2)
  1. [Abstract] The abstract states that 'item-level Type-I error and statistical power conditional on joint model convergence were reported,' but does not indicate whether standard errors or variability measures across replications are provided in the tables or figures.
  2. [Methods] Notation for the two-level model (random effects for examinee-item dependencies) should be introduced with an explicit equation early in the methods section to allow readers to verify how the CAT-induced structure is parameterized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments highlighting important issues with conditional reporting and simulation transparency. We agree these points warrant revision to strengthen the manuscript and address potential selection effects.

read point-by-point responses
  1. Referee: [Abstract / Simulation results] Abstract and simulation results section: Type-I error and power are reported only conditional on joint model convergence, yet the text states that convergence 'varied systematically across simulated conditions' (test length, exposure rate, model type). This selection effect is load-bearing for the central claim of superior spurious-DIF control, because if the two-level model converges less frequently precisely in the short-test/low-exposure regimes where the largest gains are reported, the headline improvements are computed on a non-representative subset of replications and do not establish superior operating characteristics for the procedure as a whole.

    Authors: We acknowledge that conditional reporting on convergence introduces a potential selection bias, and that this is particularly relevant given the systematic variation noted in the manuscript. The paper already emphasizes that inferential accuracy and convergence reliability are intertwined in CAT DIF settings. In revision we will add a dedicated discussion of this limitation, include a table or figure of convergence rates by condition (test length, exposure rate, model), and clarify the scope of the reported advantages. We will also explore adding sensitivity analyses using the existing replications where possible. revision: yes

  2. Referee: [Simulation study] Simulation design: No information is given on the number of replications per cell, the exact convergence criterion, or whether failed replications were re-run with different starting values. Without these details it is impossible to judge whether the reported conditional metrics are stable or whether differential non-convergence rates distort the model comparisons.

    Authors: We will revise the simulation design section to report the number of replications per cell, the exact convergence criterion (including software defaults, iteration limits, and tolerance thresholds), and details on handling of non-converged cases, including any re-runs with alternative starting values. These additions will improve reproducibility and allow better assessment of result stability. revision: yes

Circularity Check

0 steps flagged

No circularity: model proposal and simulation results are independent of inputs

full rationale

The paper proposes a two-level logistic model to account for CAT-induced dependencies and evaluates it via Monte Carlo simulations comparing Type-I error and power against single-level alternatives. No equations, fitted parameters, or predictions reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central performance claims rest on new simulation output rather than renaming or re-deriving prior results. Convergence conditioning is a reporting limitation but does not create definitional circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no specific free parameters, axioms, or invented entities are detailed beyond standard logistic model components. The approach relies on the unelaborated assumption that the two-level structure captures nuisance effects.

pith-pipeline@v0.9.0 · 5766 in / 971 out tokens · 27612 ms · 2026-05-23T20:43:53.327372+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

  1. [1]

    A., Khan, D

    Ali, A., Ali, S., Khan, S. A., Khan, D. M., Abbas, K., Khalil, A., Manzoor, S., & Khalil, U. (2019). Sample size issues in multilevel logistic regression models. PLOS ONE, 14(11), e0225427. https://doi.org/10.1371/journal.pone.0225427 American Educational Research Association, American Psychological Association, & National Council on Measurement in Educat...

  2. [2]

    DeMars, C. E. (2022). The (non)impact of misfitting items in computerized adaptive testing. Journal of Computerized Adaptive Testing, 9(2). Retrieved February 6, 2023, from https://jcatpub.net/index.php/jcat/article/view/93 Dorman, J. P. (2008). The effect of clustering on statistical tests: An illustration using classroom environment data. Educational Ps...

  3. [3]

    F., & Henry, N

    https://doi.org/10.1207/s15324818ame0204_6 Lazarsfeld, P. F., & Henry, N. (1968).Latent structure analysis. Houghton Mifflin. Lei, P.-W., Chen, S.-Y ., & Yu, L. (2006). Comparing methods of assessing differential item functioning in a computerized adaptive testing environment. Journal of Educational Measurement, 43(3), 245–264. https://doi.org/10.1111/j.1...

  4. [4]

    COMPUTERIZED ADAPTIVE TESTING 36 Snijders, T. A. B., & Bosker, R. (2011, November 4). Multilevel analysis: An introduction to basic and advanced multilevel modeling (2nd ed.). SAGE Publications, Inc. Stocking, M. L. (1988). Scale drift in on-line calibration. Educational Testing Service. Princeton, NJ. Retrieved October 31, 2022, from http://onlinelibrary...

  5. [5]

    https://doi.org/10.1186/s40536-014-0004-5 Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x Tacq, J. (1986). Van multiniveau probleem naar multiveau analyse. Department of Research Meth...

  6. [6]

    Table B1 Statistics of examinees and provisional ability estimates (for part of the items)

    Table B1 provides the descriptive statistics for a subset of these items. Table B1 Statistics of examinees and provisional ability estimates (for part of the items). provisional Ability ˆθs Item Parameters Item ρ(y) Intervalsi Examineesii µ σ Min. Max. a b c MP52024 0.07 24 1634 0.58 0.41 -0.39 1.93 1.646 0.441 0.232 MP52039 0.00 7 18 -1.00 0.17 -1.36 -0....