Analysis of Linked Files: A Missing Data Perspective
Pith reviewed 2026-05-23 23:44 UTC · model grok-4.3
The pith
Record linkage can be treated as a missing data problem to correct biases from linkage errors in linked files.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that record linkage is best understood as a missing data problem, with linkage errors governed by mechanisms such as missing at random or missing not at random. This framing allows existing analysis methods to be grouped into likelihood and Bayesian approaches, imputation approaches, and weighting approaches according to how each handles the linkage mechanism. The paper delineates the assumptions each group requires and shows through simulations how performance depends on whether those assumptions match the true error process.
What carries the argument
Mapping linkage errors onto standard missing data mechanisms (MAR, MNAR) to classify and evaluate analysis methods.
If this is right
- Ignoring linkage errors produces biased or overly precise estimates of associations.
- Methods fall into likelihood/Bayesian, imputation, or weighting categories depending on how they model the linkage mechanism.
- Valid inference requires explicit assumptions about whether linkage errors are missing at random or not at random.
- Simulation performance of each method varies with the true linkage error mechanism.
Where Pith is reading between the lines
- The framework could guide development of software that jointly performs linkage and analysis while propagating uncertainty.
- Extensions might address linkage across more than two files or with time-varying records.
- Health and administrative data applications could adopt these methods to report credible intervals that include linkage uncertainty.
Load-bearing premise
The linkage error process can be fully characterized by standard missing-data mechanisms without residual dependence on the variables of interest that is not captured by the model.
What would settle it
A dataset or simulation in which linkage errors depend directly on the outcome variable in a way not captured by the assumed missingness mechanism, yet the proposed methods still eliminate bias.
read the original abstract
In many applications, researchers seek to identify overlapping entities across multiple data files. Record linkage algorithms facilitate this task, in the absence of unique identifiers. As these algorithms rely on semi-identifying information, they may miss records that represent the same entity, or incorrectly link records that do not represent the same entity. Analysis of linked files commonly ignores such linkage errors, resulting in biased, or overly precise estimates of the associations of interest. We view record linkage as a missing data problem, and delineate the linkage mechanisms that underpin analysis methods with linked files. Following the missing data literature, we group these methods under three categories: likelihood and Bayesian methods, imputation methods, and weighting methods. We summarize the assumptions and limitations of the methods, and evaluate their performance in a wide range of simulation scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript frames record linkage as a missing-data problem, delineates the underlying linkage mechanisms (under MAR/MNAR and related categories), groups existing analysis methods into likelihood/Bayesian, imputation, and weighting classes, summarizes their assumptions and limitations, and evaluates performance via simulations across a wide range of scenarios.
Significance. If the central framing holds, the paper supplies a coherent synthesis that lets practitioners import standard missing-data tools to linked-file analyses, potentially reducing bias from ignored linkage errors. The simulation component is load-bearing for demonstrating when the three method classes succeed or fail.
major comments (2)
- [Simulation study (as described in abstract and methods)] The central claim requires that linkage-error dependence on substantive variables is fully captured by the observed data used to define the missingness mechanism. If linkage probability depends directly on an analysis variable (e.g., the outcome) orthogonal to the covariates entering the linkage model, the ignorability conditions fail and the grouped methods inherit the usual missing-data bias. The simulation evaluation must therefore include explicit residual-dependence regimes; absent that, the scope of the perspective remains conditional on an untested modeling assumption.
- [Abstract and simulation section] Abstract states that the simulation study evaluates performance 'in a wide range of simulation scenarios' yet supplies no information on design, sample sizes, error metrics, or whether residual-dependence cases were examined. This detail is necessary to assess whether the reported limitations of the three method classes are supported by the evidence.
Simulated Author's Rebuttal
We thank the referee for these detailed and constructive comments on the simulation study and its description. We address each point below and outline revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Simulation study (as described in abstract and methods)] The central claim requires that linkage-error dependence on substantive variables is fully captured by the observed data used to define the missingness mechanism. If linkage probability depends directly on an analysis variable (e.g., the outcome) orthogonal to the covariates entering the linkage model, the ignorability conditions fail and the grouped methods inherit the usual missing-data bias. The simulation evaluation must therefore include explicit residual-dependence regimes; absent that, the scope of the perspective remains conditional on an untested modeling assumption.
Authors: We agree that direct dependence of linkage probability on the outcome (orthogonal to observed covariates) represents an MNAR mechanism outside standard ignorability assumptions, and that the grouped methods would then inherit bias. The manuscript already delineates MNAR linkage mechanisms and their implications for each method class in the assumptions and limitations sections. However, the original simulations did not explicitly include such residual-dependence regimes. To address this, we will expand the simulation study to incorporate these cases and report the resulting performance of the three method classes. revision: yes
-
Referee: [Abstract and simulation section] Abstract states that the simulation study evaluates performance 'in a wide range of simulation scenarios' yet supplies no information on design, sample sizes, error metrics, or whether residual-dependence cases were examined. This detail is necessary to assess whether the reported limitations of the three method classes are supported by the evidence.
Authors: The abstract is intentionally concise and does not contain full methodological details, which is standard. The simulation section of the manuscript does describe the overall design and scenarios, but we acknowledge that it lacks explicit reporting of sample sizes, error metrics, and confirmation regarding residual-dependence cases. We will revise the simulation section to provide these specifics, including a clear statement that residual-dependence regimes were not part of the original design but will be added in the revision. revision: yes
Circularity Check
No significant circularity; perspective applies external missing-data framework
full rationale
The paper frames record linkage as a missing-data problem and groups existing methods into three standard categories (likelihood/Bayesian, imputation, weighting) drawn from the missing-data literature. No derivation chain, equation, or central claim reduces by construction to the authors' own fitted parameters, self-citations, or ansatzes. Simulations evaluate performance across scenarios but do not create self-referential predictions. The delineation relies on standard MAR/MNAR mechanisms without internal self-definition or load-bearing self-citation. This is the expected finding for a review-and-simulation paper whose contribution is organizational rather than a closed mathematical derivation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
M., Abramowitz, J., Levenstein, M
Abowd, J. M., Abramowitz, J., Levenstein, M. C., McCue, K., Patki, D., Raghunathan, T. E., Rodgers, A. M., Shapiro, M. D., Wasi, N., and Zinsser, D. (2021). Finding needles in haystacks: Multiple-imputation record linkage using machine learning. Working Paper 21-35, Center for Economic Studies, U.S. Census Bureau
work page 2021
-
[2]
Aleshin-Guendel, S. and Sadinle, M. (2022). Multifile partitioning for record linkage and duplicate detection. Journal of the American Statistical Association , 0(0):1--10
work page 2022
-
[3]
Asher, J., Resnick, D., Brite, J., Brackbill, R., and Cone, J. (2020). An introduction to probabilistic record linkage with a focus on linkage processing for wtc registries. International Journal of Environmental Research and Public Health , 17(18):6937
work page 2020
-
[4]
Baxter, R. and Christen, P. (2003). A comparison of fast blocking methods for record linkage, cmis technical report 03/139. In Proceedings of ACM SIGKDD'03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation , pages 39--48
work page 2003
-
[5]
Belin, T. R. and Rubin, D. B. (1995). A method for calibrating false-match rates in record linkage. Journal of the American Statistical Association , 90(430):694–707
work page 1995
-
[6]
Bilenko, M., Kamath, B., and Mooney, R. (2006). Adaptive blocking: Learning to scale up record linkage. In Proceedings of the Sixth IEEE International Conference on Data Mining , pages 87--96
work page 2006
-
[7]
Bilenko, M. and Mooney, R. J. (2003). On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation , pages 7--12
work page 2003
-
[8]
Binette, O. and Steorts, R. C. (2022). (almost) all of entity resolution. Science Advances , 8
work page 2022
-
[9]
Bird, S. M. and King, R. (2018). Multiple systems estimation (or capture-recapture estimation) to inform public policy. Annual Review of Statistics and its Application , 5:95--118
work page 2018
-
[10]
Bohensky, M. (2015). Bias in data linkage studies , chapter 4, pages 63--82. John Wiley and Sons, Ltd
work page 2015
-
[11]
Brenner, H., Schmidtmann, I., and Stegmaier, C. (1997). Effects of record linkage errors on registry-based follow-up studies. Statistics in Medicine , 16(23):2633--2643
work page 1997
-
[12]
Briscolini, D., Di Consiglio, L., Liseo, B., Tancredi, A., and Tuoto, T. (2018). New methods for small area estimation with linkage uncertainty. International Journal of Approximate Reasoning , 94:30--42
work page 2018
-
[13]
Campbell, K., Deck, D., and Krupski, A. (2008). Record linkage software in the public domain: A comparison of link plus, the link king, and a `basic' deterministic algorithm. Health Informatics Journal , 14:5--15
work page 2008
-
[14]
Campbell, S. R., Resnick, D. M., Cox, C. S., and Mirel, L. B. (2021). Using supervised machine learning to identify efficient blocking schemes for record linkage. Statistical Journal of the IAOS , 37(2):673--680
work page 2021
-
[15]
Cangul, M. Z., Chretien, Y. R., Gutman, R., and Rubin, D. B. (2009). Testing treatment effects in unconfounded studies under model misspecification: Logistic regression, discretization, and their combination. Statistics in Medicine , 28(20):2531--2551
work page 2009
-
[16]
Chambers, R. (2009). Regression analysis of probability-linked data. Statisphere Official Statistics , 4
work page 2009
-
[17]
Chambers, R. and Diniz da Silva , A. (2020). Improved secondary analysis of linked data: A framework and an illustration . Journal of the Royal Statistical Society, Series A , 183:37--59
work page 2020
-
[18]
Chambers, R., Salvati, N., Fabrizi, E., and Diniz da Silva , A. (2019). Domain estimation under informative linkage. Statistical Theory and Related Fields , 3(2):90--102
work page 2019
-
[19]
Chambers, R. L., Fabrizi, E., Ranalli, M. G., Salvati, N., and Wang, S. (2022). Robust regression using probabilistically linked data. WIREs Computational Statistics , page e1596
work page 2022
-
[20]
Chipperfield, J. (2019). A weighting approach to making inference with probabilistically linked data. Statistica Neerlandica , 73(3):333--350
work page 2019
-
[21]
Chipperfield, J. O. and Chambers, R. L. (2015). Using the bootstrap to account for linkage errors when analysing probabilistically linked categorical data. Journal of Official Statistics , 31(3):397
work page 2015
-
[22]
Christen, P. (2007). A two-step classification approach to unsupervised record linkage. In Proceedings of the Sixth Australasian Conference on Data Mining and Analytics , volume 70, page 111–119
work page 2007
-
[23]
Christen, P. (2008a). Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 151--159
-
[24]
Christen, P. (2008b). Automatic training example selection for scalable unsupervised record linkage. In Washio, T., Suzuki, E., Ting, K. M., and Inokuchi, A., editors, Advances in Knowledge Discovery and Data Mining: 12th Pacific-Asia Conference, PAKDD , pages 511--518
-
[25]
Christen, P. and Goiser, K. (2005). Assessing deduplication and data linkage quality: What to measure? In Proceedings of the Fourth Australasian Data Mining Conference
work page 2005
-
[26]
Christen, P. and Goiser, K. (2007). Quality and complexity measures for data linkage and deduplication. In Quality Measures in Data Mining , pages 127--151. Springer
work page 2007
-
[27]
Christen, P., Ranbaduge, T., and Schnell, R. (2020). Linking Sensitive Data: Methods and Techniques for Practical Privacy-Preserving Information Sharing . Springer, Cham
work page 2020
-
[28]
Cochinwala, M., Kurien, V., Lalk, G., and Shasha, D. (2001). Efficient data reconciliation. Information Sciences , 137(1):1--15
work page 2001
-
[29]
Collins, L. M., Schafer, J. L., and Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures . Psychological Methods , 6:330--351
work page 2001
-
[30]
Conti, G., Fr \"u hwirth-Schnatter, S., Heckman, J. J., and Piatek, R. (2014). Bayesian exploratory factor analysis. Journal of Econometrics , 183(1):31--57
work page 2014
-
[31]
Cook, L. J., Olson, L. M., and Dean, J. M. (2001). Probabilistic Record Linkage: Relationships between File Sizes, Identifiers, and Match Weights . Methods of Information in Medicine , 40:196--203
work page 2001
-
[32]
Copas, J. B. and Hilton, F. J. (1990). Record linkage: Statistical models for matching computer records. Journal of the Royal Statistical Society. Series A (Statistics in Society) , 153(3):287--320
work page 1990
-
[33]
Daggy, J., Xu, H., Hui, S., and Grannis, S. (2014). Evaluating latent class models with conditional dependence in record linkage. Statistics in medicine , 33(24):4250--4265
work page 2014
-
[34]
Dalzell, N. M. and Reiter, J. P. (2018). Regression modeling and file matching using possibly erroneous matching variables. Journal of Computational and Graphical Statistics , 27(4):728--738
work page 2018
-
[35]
Dasylva, A., Titus, R.-C., and Thibault, C. (2014). Overcoverage in the 2011 canadian census. In Proceedings of Statistics Canada Symposium
work page 2014
-
[36]
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) , 39(1):1--38
work page 1977
-
[37]
Di Consiglio , L. and Tuoto, T. (2015). Coverage evaluation on probabilistically linked data. Journal of Official Statistics , 31:415--429
work page 2015
-
[38]
Di Consiglio , L. and Tuoto, T. (2018). Population size estimation and linkage errors: The multiple lists case. Journal of Official Statistics , 34:889--908
work page 2018
-
[39]
Doidge, J. C. and Harron, K. (2018). Demystifying probabilistic linkage: Common myths and misconceptions. International journal of population data science , 3:410(1)
work page 2018
-
[40]
D'Orazio, M., Di Zio , M., and Scanu, M. (2006). Statistical Matching: Theory and Practice . Hoboken, NJ: Wiley
work page 2006
-
[41]
B., Tyree, S., Meyer, A.-M., Meyer, A., Green, L., and Carpenter, W
Dusetzina, S. B., Tyree, S., Meyer, A.-M., Meyer, A., Green, L., and Carpenter, W. R. (2014). Linking data for health services research: A framework and instructional guide, rockville, md: Agency for healthcare research and quality (us)
work page 2014
-
[42]
Elfeky, M. G., Verykios, V. S., Elmagarmid, A. K., Ghanem, T. M., and Kuwait, W. A. R. (2003). Record linkage: A machine learning approach, a toolbox, and a digital government web service. Purdue e-Pubs. Purdue University, West Lafayette
work page 2003
-
[43]
Enamorado, T., Fifield, B., and Imai, K. (2019). Using a probabilistic model to assist merging of large-scale administrative records. American Political Science Review, , 113:353--371
work page 2019
-
[44]
Fellegi, I. P. and Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association , (64):1183--1210
work page 1969
-
[45]
Fienberg, S. E. (1972). The multiple recapture census for closed populations and incomplete 2k contingency tables. Biometrika , 59:591--603
work page 1972
-
[46]
Fienberg, S. E. and Manrique-Vallier, D. (2009). Integrated methodology for multiple systems estimation and record linkage using a missing data formulation. AStA Advances in Statistical Analysis , 93(1):49–60
work page 2009
-
[47]
Fisher, R. A. and Yates, F. (1963). Statistical Tables for Biological, Agricultural and Medical Research . Oliver and Boyd: Edinburgh, UK
work page 1963
-
[48]
Fortini, M., Liseo, B., Nuccitelli, A., and Scanu, M. (2001). On bayesian record linkage. Research in Official Statistics , 4:185--198
work page 2001
-
[49]
Freedman, D. A. (2006). On the so-called “huber sandwich estimator” and “robust standard errors”. The American Statistician , 60(4):299--302
work page 2006
-
[50]
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian data analysis . CRC press
work page 2013
-
[51]
C., Smith, P., Dibben, C., and Goldstein, H
Gilbert, R., Lafferty, R., Hagger-Johnson, G., Harron, K., Zhang, L. C., Smith, P., Dibben, C., and Goldstein, H. (2017). Guild: Guidance for information about linking data sets. Journal of Public Health , 40:191--198
work page 2017
-
[52]
Golden, C. and Mirel, L. B. (2021). Enhancement of health surveys with data linkage. In Chun, A. Y., Larsen, M., Durrant, G., and Reiter, J. P., editors, Administrative Records for Survey Methodology , pages 105--138. Wiley
work page 2021
-
[53]
Goldstein, H., Carpenter, J., Kenward, M. G., and Levin, K. A. (2009). Multilevel models with multivariate mixed response types. Statistical Modelling , 9(3):173--197
work page 2009
-
[54]
Goldstein, H. and Harron, K. (2015). Record linkage: A missing data problem. In Harron, K., Goldstein, H., and Dibben, C., editors, Methodological Developments in Data Linkage , volume 1, pages 109--124. John Wiley & Sons
work page 2015
-
[55]
Goldstein, H., Harron, K., and Wade, A. (2012). The analysis of record-linked data using multiple imputation with data value priors. Statistics in Medicine , 31(28):3481--3493
work page 2012
-
[56]
Gomatam, S., Carter, R., Ariet, M., and Mitchell, G. (2002). An empirical comparison of record linkage procedures. Statistics in Medicine , 21:1485--1496
work page 2002
-
[57]
Green, P. J. and Mardia, K. V. (2006). Bayesian alignment using hierarchical models, with applications in protein bioinformatics. Biometrika , 93(2):235--254
work page 2006
-
[58]
Gu, L. and Baxter, R. (2006). Decision models for record linkage. In Williams, G. and Simoff, S., editors, Data Mining, Lecture Notes in Computer Science , pages 146--160. Springer, Berlin, Heidelberg
work page 2006
-
[59]
Gutman, R., Afendulis, C. C., and Zaslavsky, A. M. (2013). A bayesian procedure for file linking to analyze end-of-life medical costs. Journal of the American Statistical Association , 108(501):34–47
work page 2013
-
[60]
Gutman, R., Sammartino, C., Green, T., and Montague, B. (2016). Error adjustments for file linking methods using encrypted unique client identifier (euci) with application to recently released prisoners who are hiv+. Statistics in Medicine , 35(1):115--129
work page 2016
-
[61]
Haas, J. S., Brandenburg, J. A., Udvarhelyi, I. S., and Epstein, A. M. (1994). Creating a comprehensive database to evaluate health coverage for pregnant women: The completeness and validity of a computerized linkage algorithm. Medical Care , 32(10):1053--1057
work page 1994
-
[62]
Hall, R. and Fienberg, S. (2012). Valid statistical inference on automatically matched files. In Domingo-Ferrer, J. and Muralidhar, K., editors, Proceedings of the International Conference on Privacy in Statistical Databases , pages 131--142
work page 2012
-
[63]
Han, Y. (2018). Statistical Inference Using Data From Multiple Files Combined Through Record Linkage. PhD thesis, University of Maryland
work page 2018
-
[64]
Han, Y. and Lahiri, P. (2019). Statistical analysis with linked data . International Statistical Review , 87:1013 -- 1038
work page 2019
-
[65]
Harron, K., Dibben, C., Boyd, J., Hjern, A., Azimaee, M., Barreto, M. L., and Goldstein, H. (2017). Challenges in administrative data linkage for research. Big Data & Society , 4(2)
work page 2017
-
[66]
Harron, K., Wade, A., Gilbert, R., Muller-Pebody, B., and Goldstein, H. (2014). Evaluating bias due to data linkage error in electronic healthcare records. BMC Medical Research Methodology , 14(1):1--10
work page 2014
-
[67]
Hof, M. H. P., Ravelli, A. C., and Zwinderman, A. H. (2017). A probabilistic record linkage model for survival data. Journal of the American Statistical Association , 112(520):1504--1515
work page 2017
-
[68]
Hof, M. H. P. and Zwinderman, A. H. (2012). Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables. Statistics in Medicine , 31(30):4231--4242
work page 2012
-
[69]
Hof, M. H. P. and Zwinderman, A. H. (2014). A mixture model for the analysis of data derived from record linkage. Statistics in Medicine , 34(1):74–92
work page 2014
-
[70]
Isaki, C. and Schultz, L. (1987). The effects of correlation and matching error on dual system estimation. Communications in Statistics - Theory and Methods , 16:2405--2427
work page 1987
-
[71]
Jaro, M. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association , (84):414--420
work page 1989
-
[72]
Jiang, J., Lahiri, P., and Wan, S.-M. (2002). A unified jackknife theory for empirical best prediction with m-estimation. The Annals of Statistics , 30:1782--1810
work page 2002
-
[73]
Kamat, G., Shan, M., and Gutman, R. (2023). Bayesian record linkage with variables in one file. Statistics in Medicine , 42:4931--4951
work page 2023
-
[74]
Kim, G. and Chambers, R. (2012a). Regression analysis under incomplete linkage. Computational Statistics and Data Analysis , 56(518):2756--2770
-
[75]
Kim, G. and Chambers, R. (2012b). Regression analysis under probabilistic multi-linkage. Statistica Neerlandica , 66:64--79
-
[76]
Kim, G. and Chambers, R. (2015). Unbiased regression estimation under correlated linkage errors. Stat , 4(1):32--45
work page 2015
-
[77]
Krewski, D., Dewanji, A., Wang, Y., Bartlett, S., Zielinski, J., and Mallick, R. (2005). The effect of record linkage errors on risk estimates in cohort mortality studies. Survey Methodology , 31:13–21
work page 2005
-
[78]
Lahiri, P. and Larsen, M. D. (2005). Regression analysis with linked data. Journal of the American Statistical Association , 100(469):222--230
work page 2005
-
[79]
Lariscy, J. T. (2011). Differential record linkage by hispanic ethnicity and age in linked mortality studies: Implications for the epidemiologic paradox. Journal of Aging and Health , 23(8):1263--1284
work page 2011
-
[80]
Larsen, M. D. (2002). Comments on hierarchical bayesian record linkage. In Proceedings of the Survey Methods Section , pages 1995--2000. American Statistical Association
work page 2002
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.