Propensity Score Methods for Local Test Score Equating: Stratification and Inverse Probability Weighting
Pith reviewed 2026-05-23 19:06 UTC · model grok-4.3
The pith
Propensity score stratification and inverse probability weighting enable local test equating using only covariates when no anchor test exists.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that propensity scores estimated from observed covariates can substitute for anchor test scores in local equating. By stratifying on these scores or applying inverse probability weights, the methods produce group-adjusted equating functions that condition on individual-level information and thereby meet the equity requirement without requiring an anchor test.
What carries the argument
Propensity scores, defined as the estimated probability of group membership given the covariates, which are then used either to form strata of comparable examinees or to weight observations inversely to their group probability.
If this is right
- Local equating becomes feasible in testing programs that collect only background covariates rather than anchor items.
- Stratification creates discrete comparable subgroups while inverse probability weighting retains all data through rebalancing.
- Method performance improves as the covariates more strongly predict ability differences.
- The two approaches can be compared directly on the same data to choose the better performer for a given correlation strength.
Where Pith is reading between the lines
- The same propensity-score logic might extend to other psychometric adjustments such as differential item functioning detection when anchor data are absent.
- Programs could test sensitivity by deliberately omitting strong covariates and checking whether equating accuracy drops as predicted.
- If covariates are collected routinely, these methods could reduce reliance on anchor tests and thereby shorten test lengths.
Load-bearing premise
Propensity scores calculated from the observed covariates serve as adequate proxies for the unobserved latent ability differences between the groups.
What would settle it
A dataset in which the true correlation between the covariates and ability is measured or simulated to be near zero, yet the equated scores still show different conditional distributions for examinees of equal ability across groups.
Figures
read the original abstract
In test equating, ensuring score comparability across different test forms is crucial but particularly challenging when test groups are non-equivalent and no anchor test is available. Local test equating aims to satisfy Lord's equity requirement by conditioning equating transformations on individual-level information, typically using anchor test scores as proxies for latent ability. However, anchor tests are not always available in practice. This paper introduces two novel propensity score-based methods for local equating: stratification and inverse probability weighting (IPW). These methods use covariates to account for group differences, with propensity scores serving as proxies for latent ability differences between test groups. The stratification method partitions examinees into comparable groups based on similar propensity scores, while IPW assigns weights inversely proportional to the probability of group membership. We evaluate these methods through empirical analysis and simulation studies. Results indicate both methods can effectively adjust for group differences, with their relative performance depending on the strength of covariate-ability correlations. The study extends local equating methodology to cases where only covariate information is available, providing testing programs with new tools for ensuring fair score comparability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes two propensity score-based methods (stratification and inverse probability weighting) for local test equating in non-equivalent groups without anchor tests. Covariates are used to estimate propensity scores as proxies for latent ability differences; the methods are evaluated in simulation studies and an empirical analysis, with the claim that both can adjust for group differences (with relative performance depending on covariate-ability correlations) and thereby extend local equating to covariate-only settings.
Significance. If the central claim holds under the required assumptions, the work provides a practical extension of local equating methodology to common testing scenarios lacking anchors, leveraging established causal-inference tools in a psychometric context. The simulation framework offers controlled evidence of performance when the proxy assumption is satisfied.
major comments (2)
- [Simulation study section] Simulation study section: the data-generating processes are constructed under the strong ignorability assumption (group membership ⊥ latent ability | covariates), yet no sensitivity analyses or results under unmeasured confounding are reported. This is load-bearing because the claim that the methods satisfy Lord's equity without anchors rests on the propensity scores serving as adequate proxies; violation would leave residual bias in the conditional score distributions.
- [Empirical analysis section] Empirical analysis section: the abstract and method description state that results support effectiveness, but the manuscript provides no details on sample sizes, equating accuracy metrics, or how group-ability differences were quantified or controlled. Without these, it is not possible to verify whether the empirical results actually back the effectiveness claim.
minor comments (2)
- [Abstract] Abstract: lacks any mention of simulation design parameters, sample sizes, or performance metrics, which reduces clarity for readers.
- [Methods] Notation for propensity score estimation and weighting formulas could be clarified with an explicit equation linking the estimated propensity score to the equating transformation.
Simulated Author's Rebuttal
We thank the referee for these detailed comments on the simulation and empirical sections. We address each point below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Simulation study section] Simulation study section: the data-generating processes are constructed under the strong ignorability assumption (group membership ⊥ latent ability | covariates), yet no sensitivity analyses or results under unmeasured confounding are reported. This is load-bearing because the claim that the methods satisfy Lord's equity without anchors rests on the propensity scores serving as adequate proxies; violation would leave residual bias in the conditional score distributions.
Authors: The simulation design intentionally generates data under strong ignorability to evaluate the methods precisely when the key assumption holds and to isolate the effect of varying covariate-ability correlations. This mirrors standard practice in causal inference papers that first establish performance under the identifying assumption before exploring violations. We acknowledge that sensitivity analyses for unmeasured confounding would be a valuable addition to illustrate robustness limits. We will revise the simulation section to include a brief discussion of this assumption and, if feasible within the existing framework, a limited sensitivity check (e.g., adding an unmeasured confounder in one scenario). revision: partial
-
Referee: [Empirical analysis section] Empirical analysis section: the abstract and method description state that results support effectiveness, but the manuscript provides no details on sample sizes, equating accuracy metrics, or how group-ability differences were quantified or controlled. Without these, it is not possible to verify whether the empirical results actually back the effectiveness claim.
Authors: We agree that the empirical section requires greater transparency. The full manuscript contains the relevant sample sizes, metrics (e.g., equating error measures), and descriptions of how group differences were assessed via covariates, but these details are not presented with sufficient clarity or explicit quantification. We will revise the empirical analysis section to explicitly report sample sizes, define the accuracy metrics used, and detail the quantification and control of group-ability differences, ensuring readers can directly verify the effectiveness claims. revision: yes
Circularity Check
No circularity: methods and evaluations are independent of fitted inputs
full rationale
The paper introduces stratification and IPW methods that apply standard propensity score techniques to equate scores using covariates as proxies for ability. These are evaluated on separate simulation studies and empirical data under the stated ignorability assumption, without any derivation step that reduces a claimed result to a quantity fitted from the same data or to a self-citation chain. No equations or claims in the provided text exhibit self-definition, fitted-input-as-prediction, or load-bearing self-citation; the central extension of local equating rests on external statistical methods rather than internal redefinition.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Austin, P. C. (2008). The performance of different propensity-score methods for estimating relative risks.Journal of clinical epidemiology, 61(6):537–545. Bränberg, K. and Wiberg, M. (2011). Observed score linear equating with covariates. Journal of Educational Measurement, 48(4):419–440
work page 2008
-
[2]
Cook, L. L., Eignor, D. R., and Schmitt, A. P. (1990). Equating achievement tests using samples matched on ability. ETS Research Report Series, 1990(1):i–58
work page 1990
-
[3]
Dorans, N. J., Liu, J., and Hammond, S. (2008). Anchor test type and population invariance: An exploration across subpopulations and test administrations.Applied Psychological Measurement, 32(1):81–97. González, J. and Wiberg, M. (2017). Applying test equating methods using r.Cham: Springer. Hernán, M. A. and Robins, J. M. (2006). Estimating causal effect...
work page 2008
-
[4]
Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685
work page 1952
-
[5]
Hsu, T.-C., Wu, K.-l., Yu, J.-Y. W., and Lee, M.-Y. (2002). Exploring the feasibility of collateral information test equating. International Journal of Testing, 2(1):1–14
work page 2002
-
[6]
Huber, M. (2015). Causal pitfalls in the decomposition of wage gaps.Journal of Business & Economic Statistics, 33(2):179–191
work page 2015
-
[7]
Kolen, M. J. (1990). Does matching in equating work: A discussion.Applied Measurement in Education, 3(1):97–104
work page 1990
-
[8]
Kolen, M. J. and Brennan, R. L. (2014).Test equating, Scaling and Linking: Methods and practices. New York: Springer
work page 2014
-
[9]
Liou, M., Cheng, P. E., and Li, M.-Y. (2001). Estimating comparable scores using surrogate variables. Applied Psychological Measurement, 25(2):197–207
work page 2001
-
[10]
Livingston, S. A., Dorans, N. J., and Wright, N. K. (1990). What combination of sampling and equating methods works best? Applied Measurement in Education, 3(1):73–95
work page 1990
-
[11]
Longford, N. T. (2015). Equating without an anchor for nonequivalent groups of examinees.Journal of Educational and Behavioral Statistics, 40(3):227–253
work page 2015
-
[12]
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates. Lyrén, P.-E. and Hambleton, R. K. (2011). Consequences of violated equating assumptions under the equivalent groups design. International Journal of Testing, 11(4):308–323
work page 1980
-
[13]
Moses, T., Deng, W., and Zhang, Y.-L. (2010). The use of two anchors in nonequivalent groups with anchor test (neat) equating. ETS Research Report Series, 2010(2):i–33
work page 2010
-
[14]
Paek, I., Liu, J., and Oh, H. J. (2006). Investigation of propensity score matching on linear/nonlinear equating method for the p/n/nmsqt. Technical Report SR-2006-55, ETS, Princeton, NJ
work page 2006
-
[15]
Pais, J. (2011). Socioeconomic background and racial earnings inequality: A propensity score analysis.Social science research, 40(1):37–49. 21
work page 2011
-
[16]
Powers, S. J. (2010).Impact of matched samples equating methods on equating accuracy and the adequacy of equating assumptions. The University of Iowa. R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical
work page 2010
-
[17]
Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects.Biometrika, 70(1):41–55
work page 1983
-
[18]
Rosenbaum, P. R. and Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American statistical Association, 79(387):516–524
work page 1984
-
[19]
Sungworn, N.(2009).An investigation of using collateral information to reduce equating biases of the post-stratification equating method. Ph. D. Thesis, Michigan State University
work page 2009
-
[20]
Thoemmes, F. J. and Kim, E. S. (2011). A systematic review of propensity score methods in the social sciences. Multivariate behavioral research, 46(1):90–118. van der Linden, W. J. (2011). Local observed-score equating. InStatistical models for test equating, scaling, and linking. New York: Springer. van der Linden, W. J. and Wiberg, M. (2010). Local obse...
work page 2011
-
[21]
Wallin, G. and Wiberg, M. (2019). Kernel equating using propensity scores for non-equivalent groups.Journal of Educational and Behavioral Statistics, 44(4):390–414
work page 2019
-
[22]
Wallin, G. and Wiberg, M. (2023). Model misspecification and robustness of observed-score test equating using propensity scores. Journal of Educational and Behavioral Statistics, 48(5):603–635
work page 2023
-
[23]
Wiberg, M. and Bränberg, K. (2015). Kernel equating under the non-equivalent groups with covariates design.Applied Psychological Measurement, 39(5):349–361
work page 2015
-
[24]
(2024).Generalized Kernel Equating with applications in R
Wiberg, M., Gonzalez, J., and von Davier, A. (2024).Generalized Kernel Equating with applications in R. Boca
work page 2024
-
[25]
Wiberg, M. and van der Linden, W. J. (2011). Local linear observed-score equating.Journal of Educational Mea- surement, 48:229–254
work page 2011
-
[26]
Wiberg, M., van der Linden, W. J., and von Davier, A. A. (2014). Local observed-score kernel equating.Journal of Educational Measurement, 51(1):57–74
work page 2014
-
[27]
Wright, N. K. and Dorans, N. J. (1993). Using the selection variable for matching or equating.ETS Research Report Series, 1993(1):i–22. 22
work page 1993
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.