Robustness of Refugee-Matching Gains to Off-Policy Evaluation Choices

Dominik Rothenh\"ausler; Elisabeth Paulson; Jens Hainmueller; Jeremy Ferwerda; Kirk Bansak; Michael Hotard

arxiv: 2605.06686 · v1 · submitted 2026-04-25 · 💻 cs.LG · econ.EM· stat.AP· stat.ML

Robustness of Refugee-Matching Gains to Off-Policy Evaluation Choices

Kirk Bansak , Elisabeth Paulson , Dominik Rothenh\"ausler , Jeremy Ferwerda , Jens Hainmueller , Michael Hotard This is my paper

Pith reviewed 2026-05-11 01:23 UTC · model grok-4.3

classification 💻 cs.LG econ.EMstat.APstat.ML

keywords refugee matchingoff-policy evaluationrobustnesscounterfactual impactIPWAIPWpolicy evaluation

0 comments

The pith

Refugee matching impact estimates stay consistent in size and significance across different off-policy evaluation methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper checks whether the apparent gains from matching refugees to U.S. locations based on predicted employment outcomes depend on the particular way researchers estimate what would have happened under other policies. It applies inverse probability weighting along with several variants of augmented inverse probability weighting and adds changes to the statistical models and to how the matching assignments are carried out. Across every combination the estimated improvements remain roughly the same size and reach statistical significance in most cases. The new numbers also line up closely with the earlier 2018 results. Readers should care because this pattern suggests the matching approach produces real benefits rather than results that appear only when one specific evaluation recipe is followed.

Core claim

This paper demonstrates the stability of counterfactual impact evaluation results in the context of refugee matching in the United States using a range of off-policy evaluation methods. In order to estimate counterfactual impact and test the robustness of our results, we employ several evaluation methods, including inverse probability weighting (IPW) and multiple variants of augmented inverse probability weighting (AIPW). We also consider various modifications, including alternative modeling architectures and different assignment procedures. The impact estimates remain consistent in magnitude in all scenarios as well as statistically significant in most cases. Furthermore, the estimates are

What carries the argument

Off-policy evaluation methods such as inverse probability weighting (IPW) and multiple variants of augmented inverse probability weighting (AIPW), together with changes to modeling architectures and assignment procedures.

If this is right

Impact estimates remain similar in size no matter which of the tested evaluation methods or modifications is used.
Statistical significance appears in most of the scenarios examined.
The results match the original findings from the 2018 refugee-matching study.
The gains from matching therefore do not appear to be artifacts of one narrow evaluation approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Policymakers could place more weight on matching systems for refugee placement knowing the reported benefits survive changes in how the counterfactual is estimated.
Researchers working on similar location-assignment problems in job training or housing could add the same range of evaluation checks to strengthen their own conclusions.
If the pattern generalizes, data-driven matching might be applied more widely to other groups whose outcomes depend on where they are placed.

Load-bearing premise

That the tested set of off-policy methods, model changes, and assignment rules is wide enough to cover the main ways evaluation results could differ and that each method is applied without hidden biases on the refugee data.

What would settle it

Finding one additional off-policy method or modeling change that produces impact estimates differing substantially in magnitude or losing statistical significance would challenge the claim of robustness.

Figures

Figures reproduced from arXiv: 2605.06686 by Dominik Rothenh\"ausler, Elisabeth Paulson, Jens Hainmueller, Jeremy Ferwerda, Kirk Bansak, Michael Hotard.

**Figure 2.** Figure 2: Percent Gains Over the Observed Baseline [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Previous research has investigated the potential of refugee matching for boosting refugee outcomes, first considered by Bansak et al. (2018). This paper demonstrates the stability of counterfactual impact evaluation results in the context of refugee matching in the United States using a range of off-policy evaluation methods. In order to estimate counterfactual impact and test the robustness of our results, we employ several evaluation methods, including inverse probability weighting (IPW) and multiple variants of augmented inverse probability weighting (AIPW). We also consider various modifications, including alternative modeling architectures and different assignment procedures. The impact estimates remain consistent in magnitude in all scenarios as well as statistically significant in most cases. Furthermore, the estimates are also consistent with the results originally presented in Bansak et al. (2018).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds refugee matching gains stay consistent across IPW and AIPW variants but the shared assumptions mean this doesn't rule out common bias.

read the letter

Dear colleague, The main takeaway is that the positive impact estimates from the refugee matching policy hold steady when the authors swap in different off-policy estimators on the same US data. They report similar magnitudes and mostly significant results that line up with the 2018 Bansak paper. What the work actually does is apply inverse probability weighting and several augmented IPW versions, plus changes to the modeling architectures and assignment rules, then show the numbers do not move much. That is a straightforward robustness check and it earns credit for being direct about the stability finding. The soft spot is the one flagged in the stress test: every method tested still rests on the same observed covariates and the unconfoundedness assumption. Refugee assignment has clear selection into the program, so if unmeasured factors affect both placement and outcomes, all the variants will be biased in the same direction and their agreement does not confirm the true counterfactual effect. The abstract also gives almost no implementation details on sample sizes, exact model fits, or how significance was calculated, which leaves the claim plausible but hard to verify in full. This paper is for readers who work on causal policy evaluation in immigration or similar matching settings. Someone who wants a quick data point on whether one particular set of off-policy choices drives the result would get value from it. I would send it for peer review. The policy stakes are real and a focused robustness exercise like this can still be worth referee time if the full details check out.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that counterfactual impact estimates for a US refugee matching policy are stable across off-policy evaluation methods. Applying IPW and multiple AIPW variants (with alternative modeling architectures and assignment procedures) to the Bansak et al. (2018) data yields impact estimates that remain consistent in magnitude, are statistically significant in most cases, and align with the original findings.

Significance. If the reported consistency holds under verifiable implementations, the work strengthens confidence in the original Bansak et al. (2018) positive gains by showing they are not artifacts of a single estimator choice. The direct head-to-head comparison of IPW and AIPW variants on the same refugee dataset is a useful sensitivity exercise within the class of methods that rely on observed covariates and unconfoundedness.

major comments (2)

[Methods] Methods section: The manuscript provides no sample sizes, no exact model specifications (e.g., which learners are used for the outcome regression in AIPW variants), no hyperparameter choices, and no description of how statistical significance is assessed. These omissions are load-bearing for the central claim of robustness, as it is impossible to determine whether the reported consistency reflects genuinely distinct estimators or shared implementation decisions.
[Results] Results section: The paper reports that estimates are 'consistent in magnitude' and 'statistically significant in most cases' but does not test or report whether the point estimates across methods are statistically distinguishable from one another or from the Bansak et al. (2018) benchmark. Without such comparisons, the strength of the robustness conclusion cannot be evaluated.

minor comments (1)

[Abstract] Abstract: The phrase 'several evaluation methods' and 'multiple variants' is vague; stating the exact number of AIPW variants and the distinct assignment procedures would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which help clarify how to strengthen the presentation of our robustness analysis. We address each major comment below and will incorporate revisions to improve transparency and rigor without altering the core findings.

read point-by-point responses

Referee: [Methods] Methods section: The manuscript provides no sample sizes, no exact model specifications (e.g., which learners are used for the outcome regression in AIPW variants), no hyperparameter choices, and no description of how statistical significance is assessed. These omissions are load-bearing for the central claim of robustness, as it is impossible to determine whether the reported consistency reflects genuinely distinct estimators or shared implementation decisions.

Authors: We agree that these implementation details are necessary to substantiate the robustness claim and allow readers to verify that the estimators are meaningfully distinct. The original manuscript focused on high-level results for brevity, but this was an oversight given the paper's emphasis on sensitivity to method choice. In the revised version, we will report the exact sample size drawn from the Bansak et al. (2018) data, specify the learners and architectures used for propensity scores and outcome regressions in each AIPW variant (including any modifications), detail hyperparameter selection procedures, and describe the inference method (e.g., bootstrap or analytic variance estimators) used to assess statistical significance. These additions will directly address the concern. revision: yes
Referee: [Results] Results section: The paper reports that estimates are 'consistent in magnitude' and 'statistically significant in most cases' but does not test or report whether the point estimates across methods are statistically distinguishable from one another or from the Bansak et al. (2018) benchmark. Without such comparisons, the strength of the robustness conclusion cannot be evaluated.

Authors: The referee is correct that we did not conduct or report formal statistical tests comparing point estimates across methods or against the Bansak et al. (2018) benchmark. Our robustness argument centered on the practical similarity in magnitudes, signs, and significance patterns rather than formal equivalence or difference testing. To strengthen the manuscript, we will add in revision a supplementary analysis (e.g., differences in estimates with standard errors or overlap of confidence intervals) to allow readers to assess distinguishability. At the same time, we note that even statistically distinguishable estimates can still support robustness if they remain close in magnitude and policy-relevant interpretation, which is the primary contribution here. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of estimators

full rationale

The paper applies multiple off-policy estimators (IPW and AIPW variants with different modeling choices and assignment procedures) to the same refugee data and reports that the resulting impact estimates are consistent in magnitude and mostly significant, plus consistent with Bansak et al. (2018). No derivation chain, fitted parameter, or prediction is shown to equal its own inputs by construction. The self-citation to prior work is a cross-check rather than a load-bearing premise that defines the result. The analysis remains an independent empirical robustness exercise.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard econometric assumptions required for valid off-policy evaluation (correct propensity score specification, overlap, and no unmeasured confounding) rather than new free parameters, axioms, or invented entities introduced in this paper.

pith-pipeline@v0.9.0 · 5451 in / 1086 out tokens · 48928 ms · 2026-05-11T01:23:16.768609+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Identification of IPW requires... Ignorability: Y_i(a) ⊥ A_i | X_i ... Positivity... SUTVA

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

Acharya, A., Bansak, K., and Hainmueller, J. (2022). Combining outcome-based and preference-based matching: A constrained priority mechanism.Political Analysis, 30(1):89–112

work page 2022
[2]

Ahani, N., Andersson, T., Martinello, A., Teytelboym, A., and Trapp, A. C. (2021). Place- ment optimization in refugee resettlement.Operations Research, 69(5):1468–1486

work page 2021
[3]

D., Teytelboym, A., and Trapp, A

Ahani, N., G¨ olz, P., Procaccia, A. D., Teytelboym, A., and Trapp, A. C. (2024). Dynamic placement in refugee resettlement.Operations Research, 72(3):1087–1104

work page 2024
[4]

Andrews, I., Kitagawa, T., and McCloskey, A. (2024). Inference on winners.The Quarterly Journal of Economics, 139(1):305–358

work page 2024
[5]

Weinstein, J. (2018). Improving refugee integration through data-driven algorithmic as- signment.Science, 359(6373):325–329

work page 2018
[6]

Bansak, K., Lee, S., Manshadi, V., Niazadeh, R., and Paulson, E. (2026). Dynamic match- ing with post-allocation service and its application to refugee resettlement.Management Science(forthcoming)

work page 2026
[7]

and Paulson, E

Bansak, K. and Paulson, E. (2024). Outcome-driven dynamic refugee assignment with allo- cation balancing.Operations Research, 72(6):2375–2390

work page 2024
[8]

Bansak, K., Paulson, E., and Rothenh¨ ausler, D. (2024). Learning under random distribu- tional shifts. InInternational Conference on Artificial Intelligence and Statistics, pages 3943–3951. PMLR

work page 2024
[9]

Bastani, H., Bastani, O., and McLaughlin, B. (2026). Winner’s curse drives false promises in data-driven decisions: A case study in refugee matching.arXiv preprint arXiv:2602.08892

work page arXiv 2026
[10]

Efron, B. (2011). Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106(496):1602–1614

work page 2011
[11]

P., Cho, J

Ferguson, J. P., Cho, J. H., Yang, C., and Zhao, H. (2013). Empirical Bayes correction for the winner’s curse in genetic association studies.Genetic Epidemiology, 37(1):60–68

work page 2013
[12]

Freund, D., Lykouris, T., Paulson, E., Sturt, B., and Weng, W. (2023). Group fairness in dynamic refugee assignment.arXiv preprint arXiv:2301.10642. G¨ olz, P. and Procaccia, A. D. (2019). Migration as submodular optimization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 549–556

work page arXiv 2023
[13]

Jain, G., Rothenh¨ ausler, D., Bansak, K., and Paulson, E. (2025). CTRL your shift: Clustered transfer residual learning for many small datasets.arXiv preprint arXiv:2508.11144. 9

work page arXiv 2025
[14]

Rodriguez-Diaz, P., Bansak, K., and Paulson, E. (2025). A dual perspective on decision-focused learning: Scalable training via dual-guided surrogates.arXiv preprint arXiv:2511.04909

work page arXiv 2025
[15]

and Fithian, W

Zrnic, T. and Fithian, W. (2025). A flexible defense against the winner’s curse.The Annals of Statistics, 53(6):2516–2535. 10 A Appendix A.1 Results from Data and Modeling Setup 1 This Data and Modeling Setup 1 constitutes the original setup and assignments from Bansak et al. (2018). Note that the Gains below are relative to the empirically observed emplo...

work page 2025

[1] [1]

Acharya, A., Bansak, K., and Hainmueller, J. (2022). Combining outcome-based and preference-based matching: A constrained priority mechanism.Political Analysis, 30(1):89–112

work page 2022

[2] [2]

Ahani, N., Andersson, T., Martinello, A., Teytelboym, A., and Trapp, A. C. (2021). Place- ment optimization in refugee resettlement.Operations Research, 69(5):1468–1486

work page 2021

[3] [3]

D., Teytelboym, A., and Trapp, A

Ahani, N., G¨ olz, P., Procaccia, A. D., Teytelboym, A., and Trapp, A. C. (2024). Dynamic placement in refugee resettlement.Operations Research, 72(3):1087–1104

work page 2024

[4] [4]

Andrews, I., Kitagawa, T., and McCloskey, A. (2024). Inference on winners.The Quarterly Journal of Economics, 139(1):305–358

work page 2024

[5] [5]

Weinstein, J. (2018). Improving refugee integration through data-driven algorithmic as- signment.Science, 359(6373):325–329

work page 2018

[6] [6]

Bansak, K., Lee, S., Manshadi, V., Niazadeh, R., and Paulson, E. (2026). Dynamic match- ing with post-allocation service and its application to refugee resettlement.Management Science(forthcoming)

work page 2026

[7] [7]

and Paulson, E

Bansak, K. and Paulson, E. (2024). Outcome-driven dynamic refugee assignment with allo- cation balancing.Operations Research, 72(6):2375–2390

work page 2024

[8] [8]

Bansak, K., Paulson, E., and Rothenh¨ ausler, D. (2024). Learning under random distribu- tional shifts. InInternational Conference on Artificial Intelligence and Statistics, pages 3943–3951. PMLR

work page 2024

[9] [9]

Bastani, H., Bastani, O., and McLaughlin, B. (2026). Winner’s curse drives false promises in data-driven decisions: A case study in refugee matching.arXiv preprint arXiv:2602.08892

work page arXiv 2026

[10] [10]

Efron, B. (2011). Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106(496):1602–1614

work page 2011

[11] [11]

P., Cho, J

Ferguson, J. P., Cho, J. H., Yang, C., and Zhao, H. (2013). Empirical Bayes correction for the winner’s curse in genetic association studies.Genetic Epidemiology, 37(1):60–68

work page 2013

[12] [12]

Freund, D., Lykouris, T., Paulson, E., Sturt, B., and Weng, W. (2023). Group fairness in dynamic refugee assignment.arXiv preprint arXiv:2301.10642. G¨ olz, P. and Procaccia, A. D. (2019). Migration as submodular optimization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 549–556

work page arXiv 2023

[13] [13]

Jain, G., Rothenh¨ ausler, D., Bansak, K., and Paulson, E. (2025). CTRL your shift: Clustered transfer residual learning for many small datasets.arXiv preprint arXiv:2508.11144. 9

work page arXiv 2025

[14] [14]

Rodriguez-Diaz, P., Bansak, K., and Paulson, E. (2025). A dual perspective on decision-focused learning: Scalable training via dual-guided surrogates.arXiv preprint arXiv:2511.04909

work page arXiv 2025

[15] [15]

and Fithian, W

Zrnic, T. and Fithian, W. (2025). A flexible defense against the winner’s curse.The Annals of Statistics, 53(6):2516–2535. 10 A Appendix A.1 Results from Data and Modeling Setup 1 This Data and Modeling Setup 1 constitutes the original setup and assignments from Bansak et al. (2018). Note that the Gains below are relative to the empirically observed emplo...

work page 2025