Propensity Score Methods for Local Test Score Equating: Stratification and Inverse Probability Weighting

Gabriel Wallin; Marie Wiberg

arxiv: 2410.22989 · v2 · submitted 2024-10-30 · 📊 stat.ME · stat.AP

Propensity Score Methods for Local Test Score Equating: Stratification and Inverse Probability Weighting

Gabriel Wallin , Marie Wiberg This is my paper

Pith reviewed 2026-05-23 19:06 UTC · model grok-4.3

classification 📊 stat.ME stat.AP

keywords test equatingpropensity scoreslocal equatingstratificationinverse probability weightingnon-equivalent groupsLord's equity requirementcovariate adjustment

0 comments

The pith

Propensity score stratification and inverse probability weighting enable local test equating using only covariates when no anchor test exists.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops two methods that use propensity scores computed from covariates to adjust for differences between non-equivalent test groups. Stratification divides examinees into strata with similar scores, while inverse probability weighting reweights observations to balance the groups. These techniques serve as proxies for latent ability and produce equating transformations that aim to satisfy Lord's equity requirement at the individual level. A reader would care because many operational testing programs lack anchor tests yet still need comparable scores across forms. Simulation and empirical results indicate both methods reduce group differences, with their relative success depending on how well the covariates correlate with ability.

Core claim

The central claim is that propensity scores estimated from observed covariates can substitute for anchor test scores in local equating. By stratifying on these scores or applying inverse probability weights, the methods produce group-adjusted equating functions that condition on individual-level information and thereby meet the equity requirement without requiring an anchor test.

What carries the argument

Propensity scores, defined as the estimated probability of group membership given the covariates, which are then used either to form strata of comparable examinees or to weight observations inversely to their group probability.

If this is right

Local equating becomes feasible in testing programs that collect only background covariates rather than anchor items.
Stratification creates discrete comparable subgroups while inverse probability weighting retains all data through rebalancing.
Method performance improves as the covariates more strongly predict ability differences.
The two approaches can be compared directly on the same data to choose the better performer for a given correlation strength.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same propensity-score logic might extend to other psychometric adjustments such as differential item functioning detection when anchor data are absent.
Programs could test sensitivity by deliberately omitting strong covariates and checking whether equating accuracy drops as predicted.
If covariates are collected routinely, these methods could reduce reliance on anchor tests and thereby shorten test lengths.

Load-bearing premise

Propensity scores calculated from the observed covariates serve as adequate proxies for the unobserved latent ability differences between the groups.

What would settle it

A dataset in which the true correlation between the covariates and ability is measured or simulated to be near zero, yet the equated scores still show different conditional distributions for examinees of equal ability across groups.

Figures

Figures reproduced from arXiv: 2410.22989 by Gabriel Wallin, Marie Wiberg.

**Figure 2.** Figure 2: The estimated equating functions, conditioning on different values of the anchor score. The 10th% [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: The estimated propensity scores. In the 30th percentile group, the distribution of weights remains tightly centered around 1, with limited spread. This suggests that for this group, the propensity scores closely match the treatment assignment probabilities, resulting in minimal reweighting. For the 50th, 70th, and 90th percentile groups, they show a broader range of values, with some weights considerably h… view at source ↗

**Figure 4.** Figure 4: The estimated equating functions, conditioning on different values of the estimated and stratified [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: The distribution of the weights used in the IPW-based equating method, across five groups defined [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: The estimated IPW-based equating functions, conditioning on different values of the estimated and [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

In test equating, ensuring score comparability across different test forms is crucial but particularly challenging when test groups are non-equivalent and no anchor test is available. Local test equating aims to satisfy Lord's equity requirement by conditioning equating transformations on individual-level information, typically using anchor test scores as proxies for latent ability. However, anchor tests are not always available in practice. This paper introduces two novel propensity score-based methods for local equating: stratification and inverse probability weighting (IPW). These methods use covariates to account for group differences, with propensity scores serving as proxies for latent ability differences between test groups. The stratification method partitions examinees into comparable groups based on similar propensity scores, while IPW assigns weights inversely proportional to the probability of group membership. We evaluate these methods through empirical analysis and simulation studies. Results indicate both methods can effectively adjust for group differences, with their relative performance depending on the strength of covariate-ability correlations. The study extends local equating methodology to cases where only covariate information is available, providing testing programs with new tools for ensuring fair score comparability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies propensity scores to local equating without anchors but rests on untested ignorability.

read the letter

This paper takes propensity score stratification and inverse probability weighting and applies them to local equating when only covariates are available instead of an anchor test. That is the actual new element. It sets up the two approaches in straightforward terms and reports that both can reduce group differences, with relative performance tied to how well the covariates correlate with ability. The simulations and empirical checks are presented as supportive evidence for the extension. That part is useful for anyone who needs equating tools outside the usual anchor-test setup. The soft spots are the missing study details and the central assumption. The abstract supplies no sample sizes, metrics, or design information, so the strength of the results cannot be judged from what is shown. More importantly, the methods require that the observed covariates capture all relevant differences in latent ability between groups. The paper gives no sensitivity analysis or checks for unmeasured confounding, so the stress-test concern stands: if that assumption fails, the adjusted functions will retain bias. This work is aimed at researchers in educational measurement who handle non-equivalent groups without anchors. A reader focused on equating methods would get value from the full simulation results and any robustness checks that exist in the manuscript. It has enough structure and a clear methodological step to deserve peer review so the evidence and assumption handling can be examined directly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes two propensity score-based methods (stratification and inverse probability weighting) for local test equating in non-equivalent groups without anchor tests. Covariates are used to estimate propensity scores as proxies for latent ability differences; the methods are evaluated in simulation studies and an empirical analysis, with the claim that both can adjust for group differences (with relative performance depending on covariate-ability correlations) and thereby extend local equating to covariate-only settings.

Significance. If the central claim holds under the required assumptions, the work provides a practical extension of local equating methodology to common testing scenarios lacking anchors, leveraging established causal-inference tools in a psychometric context. The simulation framework offers controlled evidence of performance when the proxy assumption is satisfied.

major comments (2)

[Simulation study section] Simulation study section: the data-generating processes are constructed under the strong ignorability assumption (group membership ⊥ latent ability | covariates), yet no sensitivity analyses or results under unmeasured confounding are reported. This is load-bearing because the claim that the methods satisfy Lord's equity without anchors rests on the propensity scores serving as adequate proxies; violation would leave residual bias in the conditional score distributions.
[Empirical analysis section] Empirical analysis section: the abstract and method description state that results support effectiveness, but the manuscript provides no details on sample sizes, equating accuracy metrics, or how group-ability differences were quantified or controlled. Without these, it is not possible to verify whether the empirical results actually back the effectiveness claim.

minor comments (2)

[Abstract] Abstract: lacks any mention of simulation design parameters, sample sizes, or performance metrics, which reduces clarity for readers.
[Methods] Notation for propensity score estimation and weighting formulas could be clarified with an explicit equation linking the estimated propensity score to the equating transformation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these detailed comments on the simulation and empirical sections. We address each point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Simulation study section] Simulation study section: the data-generating processes are constructed under the strong ignorability assumption (group membership ⊥ latent ability | covariates), yet no sensitivity analyses or results under unmeasured confounding are reported. This is load-bearing because the claim that the methods satisfy Lord's equity without anchors rests on the propensity scores serving as adequate proxies; violation would leave residual bias in the conditional score distributions.

Authors: The simulation design intentionally generates data under strong ignorability to evaluate the methods precisely when the key assumption holds and to isolate the effect of varying covariate-ability correlations. This mirrors standard practice in causal inference papers that first establish performance under the identifying assumption before exploring violations. We acknowledge that sensitivity analyses for unmeasured confounding would be a valuable addition to illustrate robustness limits. We will revise the simulation section to include a brief discussion of this assumption and, if feasible within the existing framework, a limited sensitivity check (e.g., adding an unmeasured confounder in one scenario). revision: partial
Referee: [Empirical analysis section] Empirical analysis section: the abstract and method description state that results support effectiveness, but the manuscript provides no details on sample sizes, equating accuracy metrics, or how group-ability differences were quantified or controlled. Without these, it is not possible to verify whether the empirical results actually back the effectiveness claim.

Authors: We agree that the empirical section requires greater transparency. The full manuscript contains the relevant sample sizes, metrics (e.g., equating error measures), and descriptions of how group differences were assessed via covariates, but these details are not presented with sufficient clarity or explicit quantification. We will revise the empirical analysis section to explicitly report sample sizes, define the accuracy metrics used, and detail the quantification and control of group-ability differences, ensuring readers can directly verify the effectiveness claims. revision: yes

Circularity Check

0 steps flagged

No circularity: methods and evaluations are independent of fitted inputs

full rationale

The paper introduces stratification and IPW methods that apply standard propensity score techniques to equate scores using covariates as proxies for ability. These are evaluated on separate simulation studies and empirical data under the stated ignorability assumption, without any derivation step that reduces a claimed result to a quantity fitted from the same data or to a self-citation chain. No equations or claims in the provided text exhibit self-definition, fitted-input-as-prediction, or load-bearing self-citation; the central extension of local equating rests on external statistical methods rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based solely on the abstract; no specific free parameters, axioms, or invented entities can be extracted or audited from the provided text.

pith-pipeline@v0.9.0 · 5718 in / 1037 out tokens · 40730 ms · 2026-05-23T19:06:07.345938+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Austin, P. C. (2008). The performance of different propensity-score methods for estimating relative risks.Journal of clinical epidemiology, 61(6):537–545. Bränberg, K. and Wiberg, M. (2011). Observed score linear equating with covariates. Journal of Educational Measurement, 48(4):419–440

work page 2008
[2]

L., Eignor, D

Cook, L. L., Eignor, D. R., and Schmitt, A. P. (1990). Equating achievement tests using samples matched on ability. ETS Research Report Series, 1990(1):i–58

work page 1990
[3]

J., Liu, J., and Hammond, S

Dorans, N. J., Liu, J., and Hammond, S. (2008). Anchor test type and population invariance: An exploration across subpopulations and test administrations.Applied Psychological Measurement, 32(1):81–97. González, J. and Wiberg, M. (2017). Applying test equating methods using r.Cham: Springer. Hernán, M. A. and Robins, J. M. (2006). Estimating causal effect...

work page 2008
[4]

Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685

work page 1952
[5]

W., and Lee, M.-Y

Hsu, T.-C., Wu, K.-l., Yu, J.-Y. W., and Lee, M.-Y. (2002). Exploring the feasibility of collateral information test equating. International Journal of Testing, 2(1):1–14

work page 2002
[6]

Huber, M. (2015). Causal pitfalls in the decomposition of wage gaps.Journal of Business & Economic Statistics, 33(2):179–191

work page 2015
[7]

Kolen, M. J. (1990). Does matching in equating work: A discussion.Applied Measurement in Education, 3(1):97–104

work page 1990
[8]

Kolen, M. J. and Brennan, R. L. (2014).Test equating, Scaling and Linking: Methods and practices. New York: Springer

work page 2014
[9]

E., and Li, M.-Y

Liou, M., Cheng, P. E., and Li, M.-Y. (2001). Estimating comparable scores using surrogate variables. Applied Psychological Measurement, 25(2):197–207

work page 2001
[10]

A., Dorans, N

Livingston, S. A., Dorans, N. J., and Wright, N. K. (1990). What combination of sampling and equating methods works best? Applied Measurement in Education, 3(1):73–95

work page 1990
[11]

Longford, N. T. (2015). Equating without an anchor for nonequivalent groups of examinees.Journal of Educational and Behavioral Statistics, 40(3):227–253

work page 2015
[12]

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates. Lyrén, P.-E. and Hambleton, R. K. (2011). Consequences of violated equating assumptions under the equivalent groups design. International Journal of Testing, 11(4):308–323

work page 1980
[13]

Moses, T., Deng, W., and Zhang, Y.-L. (2010). The use of two anchors in nonequivalent groups with anchor test (neat) equating. ETS Research Report Series, 2010(2):i–33

work page 2010
[14]

Paek, I., Liu, J., and Oh, H. J. (2006). Investigation of propensity score matching on linear/nonlinear equating method for the p/n/nmsqt. Technical Report SR-2006-55, ETS, Princeton, NJ

work page 2006
[15]

Pais, J. (2011). Socioeconomic background and racial earnings inequality: A propensity score analysis.Social science research, 40(1):37–49. 21

work page 2011
[16]

Powers, S. J. (2010).Impact of matched samples equating methods on equating accuracy and the adequacy of equating assumptions. The University of Iowa. R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical

work page 2010
[17]

Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects.Biometrika, 70(1):41–55

work page 1983
[18]

Rosenbaum, P. R. and Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American statistical Association, 79(387):516–524

work page 1984
[19]

Sungworn, N.(2009).An investigation of using collateral information to reduce equating biases of the post-stratification equating method. Ph. D. Thesis, Michigan State University

work page 2009
[20]

Thoemmes, F. J. and Kim, E. S. (2011). A systematic review of propensity score methods in the social sciences. Multivariate behavioral research, 46(1):90–118. van der Linden, W. J. (2011). Local observed-score equating. InStatistical models for test equating, scaling, and linking. New York: Springer. van der Linden, W. J. and Wiberg, M. (2010). Local obse...

work page 2011
[21]

and Wiberg, M

Wallin, G. and Wiberg, M. (2019). Kernel equating using propensity scores for non-equivalent groups.Journal of Educational and Behavioral Statistics, 44(4):390–414

work page 2019
[22]

and Wiberg, M

Wallin, G. and Wiberg, M. (2023). Model misspecification and robustness of observed-score test equating using propensity scores. Journal of Educational and Behavioral Statistics, 48(5):603–635

work page 2023
[23]

and Bränberg, K

Wiberg, M. and Bränberg, K. (2015). Kernel equating under the non-equivalent groups with covariates design.Applied Psychological Measurement, 39(5):349–361

work page 2015
[24]

(2024).Generalized Kernel Equating with applications in R

Wiberg, M., Gonzalez, J., and von Davier, A. (2024).Generalized Kernel Equating with applications in R. Boca

work page 2024
[25]

and van der Linden, W

Wiberg, M. and van der Linden, W. J. (2011). Local linear observed-score equating.Journal of Educational Mea- surement, 48:229–254

work page 2011
[26]

J., and von Davier, A

Wiberg, M., van der Linden, W. J., and von Davier, A. A. (2014). Local observed-score kernel equating.Journal of Educational Measurement, 51(1):57–74

work page 2014
[27]

Wright, N. K. and Dorans, N. J. (1993). Using the selection variable for matching or equating.ETS Research Report Series, 1993(1):i–22. 22

work page 1993

[1] [1]

Austin, P. C. (2008). The performance of different propensity-score methods for estimating relative risks.Journal of clinical epidemiology, 61(6):537–545. Bränberg, K. and Wiberg, M. (2011). Observed score linear equating with covariates. Journal of Educational Measurement, 48(4):419–440

work page 2008

[2] [2]

L., Eignor, D

Cook, L. L., Eignor, D. R., and Schmitt, A. P. (1990). Equating achievement tests using samples matched on ability. ETS Research Report Series, 1990(1):i–58

work page 1990

[3] [3]

J., Liu, J., and Hammond, S

Dorans, N. J., Liu, J., and Hammond, S. (2008). Anchor test type and population invariance: An exploration across subpopulations and test administrations.Applied Psychological Measurement, 32(1):81–97. González, J. and Wiberg, M. (2017). Applying test equating methods using r.Cham: Springer. Hernán, M. A. and Robins, J. M. (2006). Estimating causal effect...

work page 2008

[4] [4]

Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685

work page 1952

[5] [5]

W., and Lee, M.-Y

Hsu, T.-C., Wu, K.-l., Yu, J.-Y. W., and Lee, M.-Y. (2002). Exploring the feasibility of collateral information test equating. International Journal of Testing, 2(1):1–14

work page 2002

[6] [6]

Huber, M. (2015). Causal pitfalls in the decomposition of wage gaps.Journal of Business & Economic Statistics, 33(2):179–191

work page 2015

[7] [7]

Kolen, M. J. (1990). Does matching in equating work: A discussion.Applied Measurement in Education, 3(1):97–104

work page 1990

[8] [8]

Kolen, M. J. and Brennan, R. L. (2014).Test equating, Scaling and Linking: Methods and practices. New York: Springer

work page 2014

[9] [9]

E., and Li, M.-Y

Liou, M., Cheng, P. E., and Li, M.-Y. (2001). Estimating comparable scores using surrogate variables. Applied Psychological Measurement, 25(2):197–207

work page 2001

[10] [10]

A., Dorans, N

Livingston, S. A., Dorans, N. J., and Wright, N. K. (1990). What combination of sampling and equating methods works best? Applied Measurement in Education, 3(1):73–95

work page 1990

[11] [11]

Longford, N. T. (2015). Equating without an anchor for nonequivalent groups of examinees.Journal of Educational and Behavioral Statistics, 40(3):227–253

work page 2015

[12] [12]

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates. Lyrén, P.-E. and Hambleton, R. K. (2011). Consequences of violated equating assumptions under the equivalent groups design. International Journal of Testing, 11(4):308–323

work page 1980

[13] [13]

Moses, T., Deng, W., and Zhang, Y.-L. (2010). The use of two anchors in nonequivalent groups with anchor test (neat) equating. ETS Research Report Series, 2010(2):i–33

work page 2010

[14] [14]

Paek, I., Liu, J., and Oh, H. J. (2006). Investigation of propensity score matching on linear/nonlinear equating method for the p/n/nmsqt. Technical Report SR-2006-55, ETS, Princeton, NJ

work page 2006

[15] [15]

Pais, J. (2011). Socioeconomic background and racial earnings inequality: A propensity score analysis.Social science research, 40(1):37–49. 21

work page 2011

[16] [16]

Powers, S. J. (2010).Impact of matched samples equating methods on equating accuracy and the adequacy of equating assumptions. The University of Iowa. R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical

work page 2010

[17] [17]

Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects.Biometrika, 70(1):41–55

work page 1983

[18] [18]

Rosenbaum, P. R. and Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American statistical Association, 79(387):516–524

work page 1984

[19] [19]

Sungworn, N.(2009).An investigation of using collateral information to reduce equating biases of the post-stratification equating method. Ph. D. Thesis, Michigan State University

work page 2009

[20] [20]

Thoemmes, F. J. and Kim, E. S. (2011). A systematic review of propensity score methods in the social sciences. Multivariate behavioral research, 46(1):90–118. van der Linden, W. J. (2011). Local observed-score equating. InStatistical models for test equating, scaling, and linking. New York: Springer. van der Linden, W. J. and Wiberg, M. (2010). Local obse...

work page 2011

[21] [21]

and Wiberg, M

Wallin, G. and Wiberg, M. (2019). Kernel equating using propensity scores for non-equivalent groups.Journal of Educational and Behavioral Statistics, 44(4):390–414

work page 2019

[22] [22]

and Wiberg, M

Wallin, G. and Wiberg, M. (2023). Model misspecification and robustness of observed-score test equating using propensity scores. Journal of Educational and Behavioral Statistics, 48(5):603–635

work page 2023

[23] [23]

and Bränberg, K

Wiberg, M. and Bränberg, K. (2015). Kernel equating under the non-equivalent groups with covariates design.Applied Psychological Measurement, 39(5):349–361

work page 2015

[24] [24]

(2024).Generalized Kernel Equating with applications in R

Wiberg, M., Gonzalez, J., and von Davier, A. (2024).Generalized Kernel Equating with applications in R. Boca

work page 2024

[25] [25]

and van der Linden, W

Wiberg, M. and van der Linden, W. J. (2011). Local linear observed-score equating.Journal of Educational Mea- surement, 48:229–254

work page 2011

[26] [26]

J., and von Davier, A

Wiberg, M., van der Linden, W. J., and von Davier, A. A. (2014). Local observed-score kernel equating.Journal of Educational Measurement, 51(1):57–74

work page 2014

[27] [27]

Wright, N. K. and Dorans, N. J. (1993). Using the selection variable for matching or equating.ETS Research Report Series, 1993(1):i–22. 22

work page 1993