Analysis of Linked Files: A Missing Data Perspective

Gauri Kamat; Roee Gutman

arxiv: 2406.14717 · v3 · submitted 2024-06-20 · 📊 stat.ME · stat.AP

Analysis of Linked Files: A Missing Data Perspective

Gauri Kamat , Roee Gutman This is my paper

Pith reviewed 2026-05-23 23:44 UTC · model grok-4.3

classification 📊 stat.ME stat.AP

keywords record linkagemissing data mechanismslinked filesimputation methodsweighting methodslikelihood methodssimulation evaluation

0 comments

The pith

Record linkage can be treated as a missing data problem to correct biases from linkage errors in linked files.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that record linkage should be viewed as a missing data problem, where linkage errors correspond to standard missingness mechanisms. This perspective organizes analysis methods for linked files into three categories: likelihood and Bayesian methods, imputation methods, and weighting methods. A sympathetic reader would care because analyses that ignore these errors commonly produce biased or overprecise estimates of associations. The work summarizes the assumptions and limitations of each category and evaluates their performance across a range of simulation scenarios.

Core claim

The paper claims that record linkage is best understood as a missing data problem, with linkage errors governed by mechanisms such as missing at random or missing not at random. This framing allows existing analysis methods to be grouped into likelihood and Bayesian approaches, imputation approaches, and weighting approaches according to how each handles the linkage mechanism. The paper delineates the assumptions each group requires and shows through simulations how performance depends on whether those assumptions match the true error process.

What carries the argument

Mapping linkage errors onto standard missing data mechanisms (MAR, MNAR) to classify and evaluate analysis methods.

If this is right

Ignoring linkage errors produces biased or overly precise estimates of associations.
Methods fall into likelihood/Bayesian, imputation, or weighting categories depending on how they model the linkage mechanism.
Valid inference requires explicit assumptions about whether linkage errors are missing at random or not at random.
Simulation performance of each method varies with the true linkage error mechanism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could guide development of software that jointly performs linkage and analysis while propagating uncertainty.
Extensions might address linkage across more than two files or with time-varying records.
Health and administrative data applications could adopt these methods to report credible intervals that include linkage uncertainty.

Load-bearing premise

The linkage error process can be fully characterized by standard missing-data mechanisms without residual dependence on the variables of interest that is not captured by the model.

What would settle it

A dataset or simulation in which linkage errors depend directly on the outcome variable in a way not captured by the assumed missingness mechanism, yet the proposed methods still eliminate bias.

read the original abstract

In many applications, researchers seek to identify overlapping entities across multiple data files. Record linkage algorithms facilitate this task, in the absence of unique identifiers. As these algorithms rely on semi-identifying information, they may miss records that represent the same entity, or incorrectly link records that do not represent the same entity. Analysis of linked files commonly ignores such linkage errors, resulting in biased, or overly precise estimates of the associations of interest. We view record linkage as a missing data problem, and delineate the linkage mechanisms that underpin analysis methods with linked files. Following the missing data literature, we group these methods under three categories: likelihood and Bayesian methods, imputation methods, and weighting methods. We summarize the assumptions and limitations of the methods, and evaluate their performance in a wide range of simulation scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A review that organizes linkage methods under missing-data categories and adds simulation comparisons, but the framing assumes linkage errors fit standard mechanisms without unmodeled dependence on analysis variables.

read the letter

This paper treats record linkage as a missing-data problem and sorts existing analysis approaches into likelihood/Bayesian, imputation, and weighting groups based on the linkage mechanism. It then runs simulations to compare them across scenarios and summarizes the assumptions each group carries. That organization and the simulation benchmarks are the main things it offers. The taxonomy itself largely follows standard missing-data distinctions already in the literature, so the novelty sits mainly in the application to linkage and the side-by-side evaluation. The simulations appear to cover a range of settings and the paper flags limitations for each method class, which is useful for practitioners who need a quick map of the options. The central assumption is that any dependence between linkage errors and the variables of interest can be captured by the observed data used to define the missingness mechanism. If linkage probability depends directly on an outcome or other analysis variable in ways orthogonal to the linkage covariates, the ignorability conditions break and the grouped methods inherit bias. The stress-test note flags exactly this point, and the abstract does not show whether the simulations include such residual-dependence regimes. Without that coverage the practical scope of the perspective stays conditional. The work is aimed at statisticians who analyze linked files and want a structured overview rather than new estimators. It is coherent on its own terms and engages the relevant literature, so it deserves a serious referee even if the simulations need expansion on the dependence cases.

Referee Report

2 major / 0 minor

Summary. The manuscript frames record linkage as a missing-data problem, delineates the underlying linkage mechanisms (under MAR/MNAR and related categories), groups existing analysis methods into likelihood/Bayesian, imputation, and weighting classes, summarizes their assumptions and limitations, and evaluates performance via simulations across a wide range of scenarios.

Significance. If the central framing holds, the paper supplies a coherent synthesis that lets practitioners import standard missing-data tools to linked-file analyses, potentially reducing bias from ignored linkage errors. The simulation component is load-bearing for demonstrating when the three method classes succeed or fail.

major comments (2)

[Simulation study (as described in abstract and methods)] The central claim requires that linkage-error dependence on substantive variables is fully captured by the observed data used to define the missingness mechanism. If linkage probability depends directly on an analysis variable (e.g., the outcome) orthogonal to the covariates entering the linkage model, the ignorability conditions fail and the grouped methods inherit the usual missing-data bias. The simulation evaluation must therefore include explicit residual-dependence regimes; absent that, the scope of the perspective remains conditional on an untested modeling assumption.
[Abstract and simulation section] Abstract states that the simulation study evaluates performance 'in a wide range of simulation scenarios' yet supplies no information on design, sample sizes, error metrics, or whether residual-dependence cases were examined. This detail is necessary to assess whether the reported limitations of the three method classes are supported by the evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these detailed and constructive comments on the simulation study and its description. We address each point below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Simulation study (as described in abstract and methods)] The central claim requires that linkage-error dependence on substantive variables is fully captured by the observed data used to define the missingness mechanism. If linkage probability depends directly on an analysis variable (e.g., the outcome) orthogonal to the covariates entering the linkage model, the ignorability conditions fail and the grouped methods inherit the usual missing-data bias. The simulation evaluation must therefore include explicit residual-dependence regimes; absent that, the scope of the perspective remains conditional on an untested modeling assumption.

Authors: We agree that direct dependence of linkage probability on the outcome (orthogonal to observed covariates) represents an MNAR mechanism outside standard ignorability assumptions, and that the grouped methods would then inherit bias. The manuscript already delineates MNAR linkage mechanisms and their implications for each method class in the assumptions and limitations sections. However, the original simulations did not explicitly include such residual-dependence regimes. To address this, we will expand the simulation study to incorporate these cases and report the resulting performance of the three method classes. revision: yes
Referee: [Abstract and simulation section] Abstract states that the simulation study evaluates performance 'in a wide range of simulation scenarios' yet supplies no information on design, sample sizes, error metrics, or whether residual-dependence cases were examined. This detail is necessary to assess whether the reported limitations of the three method classes are supported by the evidence.

Authors: The abstract is intentionally concise and does not contain full methodological details, which is standard. The simulation section of the manuscript does describe the overall design and scenarios, but we acknowledge that it lacks explicit reporting of sample sizes, error metrics, and confirmation regarding residual-dependence cases. We will revise the simulation section to provide these specifics, including a clear statement that residual-dependence regimes were not part of the original design but will be added in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; perspective applies external missing-data framework

full rationale

The paper frames record linkage as a missing-data problem and groups existing methods into three standard categories (likelihood/Bayesian, imputation, weighting) drawn from the missing-data literature. No derivation chain, equation, or central claim reduces by construction to the authors' own fitted parameters, self-citations, or ansatzes. Simulations evaluate performance across scenarios but do not create self-referential predictions. The delineation relies on standard MAR/MNAR mechanisms without internal self-definition or load-bearing self-citation. This is the expected finding for a review-and-simulation paper whose contribution is organizational rather than a closed mathematical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a review paper that organizes existing methods; it introduces no new free parameters, axioms, or invented entities beyond standard missing-data assumptions already present in the cited literature.

pith-pipeline@v0.9.0 · 5655 in / 989 out tokens · 19698 ms · 2026-05-23T23:44:06.872063+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

169 extracted references · 169 canonical work pages

[1]

M., Abramowitz, J., Levenstein, M

Abowd, J. M., Abramowitz, J., Levenstein, M. C., McCue, K., Patki, D., Raghunathan, T. E., Rodgers, A. M., Shapiro, M. D., Wasi, N., and Zinsser, D. (2021). Finding needles in haystacks: Multiple-imputation record linkage using machine learning. Working Paper 21-35, Center for Economic Studies, U.S. Census Bureau

work page 2021
[2]

and Sadinle, M

Aleshin-Guendel, S. and Sadinle, M. (2022). Multifile partitioning for record linkage and duplicate detection. Journal of the American Statistical Association , 0(0):1--10

work page 2022
[3]

Asher, J., Resnick, D., Brite, J., Brackbill, R., and Cone, J. (2020). An introduction to probabilistic record linkage with a focus on linkage processing for wtc registries. International Journal of Environmental Research and Public Health , 17(18):6937

work page 2020
[4]

and Christen, P

Baxter, R. and Christen, P. (2003). A comparison of fast blocking methods for record linkage, cmis technical report 03/139. In Proceedings of ACM SIGKDD'03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation , pages 39--48

work page 2003
[5]

Belin, T. R. and Rubin, D. B. (1995). A method for calibrating false-match rates in record linkage. Journal of the American Statistical Association , 90(430):694–707

work page 1995
[6]

Bilenko, M., Kamath, B., and Mooney, R. (2006). Adaptive blocking: Learning to scale up record linkage. In Proceedings of the Sixth IEEE International Conference on Data Mining , pages 87--96

work page 2006
[7]

and Mooney, R

Bilenko, M. and Mooney, R. J. (2003). On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation , pages 7--12

work page 2003
[8]

and Steorts, R

Binette, O. and Steorts, R. C. (2022). (almost) all of entity resolution. Science Advances , 8

work page 2022
[9]

Bird, S. M. and King, R. (2018). Multiple systems estimation (or capture-recapture estimation) to inform public policy. Annual Review of Statistics and its Application , 5:95--118

work page 2018
[10]

Bohensky, M. (2015). Bias in data linkage studies , chapter 4, pages 63--82. John Wiley and Sons, Ltd

work page 2015
[11]

Brenner, H., Schmidtmann, I., and Stegmaier, C. (1997). Effects of record linkage errors on registry-based follow-up studies. Statistics in Medicine , 16(23):2633--2643

work page 1997
[12]

Briscolini, D., Di Consiglio, L., Liseo, B., Tancredi, A., and Tuoto, T. (2018). New methods for small area estimation with linkage uncertainty. International Journal of Approximate Reasoning , 94:30--42

work page 2018
[13]

Campbell, K., Deck, D., and Krupski, A. (2008). Record linkage software in the public domain: A comparison of link plus, the link king, and a `basic' deterministic algorithm. Health Informatics Journal , 14:5--15

work page 2008
[14]

R., Resnick, D

Campbell, S. R., Resnick, D. M., Cox, C. S., and Mirel, L. B. (2021). Using supervised machine learning to identify efficient blocking schemes for record linkage. Statistical Journal of the IAOS , 37(2):673--680

work page 2021
[15]

Z., Chretien, Y

Cangul, M. Z., Chretien, Y. R., Gutman, R., and Rubin, D. B. (2009). Testing treatment effects in unconfounded studies under model misspecification: Logistic regression, discretization, and their combination. Statistics in Medicine , 28(20):2531--2551

work page 2009
[16]

Chambers, R. (2009). Regression analysis of probability-linked data. Statisphere Official Statistics , 4

work page 2009
[17]

and Diniz da Silva , A

Chambers, R. and Diniz da Silva , A. (2020). Improved secondary analysis of linked data: A framework and an illustration . Journal of the Royal Statistical Society, Series A , 183:37--59

work page 2020
[18]

Chambers, R., Salvati, N., Fabrizi, E., and Diniz da Silva , A. (2019). Domain estimation under informative linkage. Statistical Theory and Related Fields , 3(2):90--102

work page 2019
[19]

L., Fabrizi, E., Ranalli, M

Chambers, R. L., Fabrizi, E., Ranalli, M. G., Salvati, N., and Wang, S. (2022). Robust regression using probabilistically linked data. WIREs Computational Statistics , page e1596

work page 2022
[20]

Chipperfield, J. (2019). A weighting approach to making inference with probabilistically linked data. Statistica Neerlandica , 73(3):333--350

work page 2019
[21]

Chipperfield, J. O. and Chambers, R. L. (2015). Using the bootstrap to account for linkage errors when analysing probabilistically linked categorical data. Journal of Official Statistics , 31(3):397

work page 2015
[22]

Christen, P. (2007). A two-step classification approach to unsupervised record linkage. In Proceedings of the Sixth Australasian Conference on Data Mining and Analytics , volume 70, page 111–119

work page 2007
[23]

Christen, P. (2008a). Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 151--159

work page
[24]

Christen, P. (2008b). Automatic training example selection for scalable unsupervised record linkage. In Washio, T., Suzuki, E., Ting, K. M., and Inokuchi, A., editors, Advances in Knowledge Discovery and Data Mining: 12th Pacific-Asia Conference, PAKDD , pages 511--518

work page
[25]

and Goiser, K

Christen, P. and Goiser, K. (2005). Assessing deduplication and data linkage quality: What to measure? In Proceedings of the Fourth Australasian Data Mining Conference

work page 2005
[26]

and Goiser, K

Christen, P. and Goiser, K. (2007). Quality and complexity measures for data linkage and deduplication. In Quality Measures in Data Mining , pages 127--151. Springer

work page 2007
[27]

Christen, P., Ranbaduge, T., and Schnell, R. (2020). Linking Sensitive Data: Methods and Techniques for Practical Privacy-Preserving Information Sharing . Springer, Cham

work page 2020
[28]

Cochinwala, M., Kurien, V., Lalk, G., and Shasha, D. (2001). Efficient data reconciliation. Information Sciences , 137(1):1--15

work page 2001
[29]

M., Schafer, J

Collins, L. M., Schafer, J. L., and Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures . Psychological Methods , 6:330--351

work page 2001
[30]

J., and Piatek, R

Conti, G., Fr \"u hwirth-Schnatter, S., Heckman, J. J., and Piatek, R. (2014). Bayesian exploratory factor analysis. Journal of Econometrics , 183(1):31--57

work page 2014
[31]

J., Olson, L

Cook, L. J., Olson, L. M., and Dean, J. M. (2001). Probabilistic Record Linkage: Relationships between File Sizes, Identifiers, and Match Weights . Methods of Information in Medicine , 40:196--203

work page 2001
[32]

Copas, J. B. and Hilton, F. J. (1990). Record linkage: Statistical models for matching computer records. Journal of the Royal Statistical Society. Series A (Statistics in Society) , 153(3):287--320

work page 1990
[33]

Daggy, J., Xu, H., Hui, S., and Grannis, S. (2014). Evaluating latent class models with conditional dependence in record linkage. Statistics in medicine , 33(24):4250--4265

work page 2014
[34]

Dalzell, N. M. and Reiter, J. P. (2018). Regression modeling and file matching using possibly erroneous matching variables. Journal of Computational and Graphical Statistics , 27(4):728--738

work page 2018
[35]

Dasylva, A., Titus, R.-C., and Thibault, C. (2014). Overcoverage in the 2011 canadian census. In Proceedings of Statistics Canada Symposium

work page 2014
[36]

P., Laird, N

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) , 39(1):1--38

work page 1977
[37]

and Tuoto, T

Di Consiglio , L. and Tuoto, T. (2015). Coverage evaluation on probabilistically linked data. Journal of Official Statistics , 31:415--429

work page 2015
[38]

and Tuoto, T

Di Consiglio , L. and Tuoto, T. (2018). Population size estimation and linkage errors: The multiple lists case. Journal of Official Statistics , 34:889--908

work page 2018
[39]

Doidge, J. C. and Harron, K. (2018). Demystifying probabilistic linkage: Common myths and misconceptions. International journal of population data science , 3:410(1)

work page 2018
[40]

D'Orazio, M., Di Zio , M., and Scanu, M. (2006). Statistical Matching: Theory and Practice . Hoboken, NJ: Wiley

work page 2006
[41]

B., Tyree, S., Meyer, A.-M., Meyer, A., Green, L., and Carpenter, W

Dusetzina, S. B., Tyree, S., Meyer, A.-M., Meyer, A., Green, L., and Carpenter, W. R. (2014). Linking data for health services research: A framework and instructional guide, rockville, md: Agency for healthcare research and quality (us)

work page 2014
[42]

G., Verykios, V

Elfeky, M. G., Verykios, V. S., Elmagarmid, A. K., Ghanem, T. M., and Kuwait, W. A. R. (2003). Record linkage: A machine learning approach, a toolbox, and a digital government web service. Purdue e-Pubs. Purdue University, West Lafayette

work page 2003
[43]

Enamorado, T., Fifield, B., and Imai, K. (2019). Using a probabilistic model to assist merging of large-scale administrative records. American Political Science Review, , 113:353--371

work page 2019
[44]

Fellegi, I. P. and Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association , (64):1183--1210

work page 1969
[45]

Fienberg, S. E. (1972). The multiple recapture census for closed populations and incomplete 2k contingency tables. Biometrika , 59:591--603

work page 1972
[46]

Fienberg, S. E. and Manrique-Vallier, D. (2009). Integrated methodology for multiple systems estimation and record linkage using a missing data formulation. AStA Advances in Statistical Analysis , 93(1):49–60

work page 2009
[47]

Fisher, R. A. and Yates, F. (1963). Statistical Tables for Biological, Agricultural and Medical Research . Oliver and Boyd: Edinburgh, UK

work page 1963
[48]

Fortini, M., Liseo, B., Nuccitelli, A., and Scanu, M. (2001). On bayesian record linkage. Research in Official Statistics , 4:185--198

work page 2001
[49]

huber sandwich estimator

Freedman, D. A. (2006). On the so-called “huber sandwich estimator” and “robust standard errors”. The American Statistician , 60(4):299--302

work page 2006
[50]

B., Stern, H

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian data analysis . CRC press

work page 2013
[51]

C., Smith, P., Dibben, C., and Goldstein, H

Gilbert, R., Lafferty, R., Hagger-Johnson, G., Harron, K., Zhang, L. C., Smith, P., Dibben, C., and Goldstein, H. (2017). Guild: Guidance for information about linking data sets. Journal of Public Health , 40:191--198

work page 2017
[52]

and Mirel, L

Golden, C. and Mirel, L. B. (2021). Enhancement of health surveys with data linkage. In Chun, A. Y., Larsen, M., Durrant, G., and Reiter, J. P., editors, Administrative Records for Survey Methodology , pages 105--138. Wiley

work page 2021
[53]

G., and Levin, K

Goldstein, H., Carpenter, J., Kenward, M. G., and Levin, K. A. (2009). Multilevel models with multivariate mixed response types. Statistical Modelling , 9(3):173--197

work page 2009
[54]

and Harron, K

Goldstein, H. and Harron, K. (2015). Record linkage: A missing data problem. In Harron, K., Goldstein, H., and Dibben, C., editors, Methodological Developments in Data Linkage , volume 1, pages 109--124. John Wiley & Sons

work page 2015
[55]

Goldstein, H., Harron, K., and Wade, A. (2012). The analysis of record-linked data using multiple imputation with data value priors. Statistics in Medicine , 31(28):3481--3493

work page 2012
[56]

Gomatam, S., Carter, R., Ariet, M., and Mitchell, G. (2002). An empirical comparison of record linkage procedures. Statistics in Medicine , 21:1485--1496

work page 2002
[57]

Green, P. J. and Mardia, K. V. (2006). Bayesian alignment using hierarchical models, with applications in protein bioinformatics. Biometrika , 93(2):235--254

work page 2006
[58]

and Baxter, R

Gu, L. and Baxter, R. (2006). Decision models for record linkage. In Williams, G. and Simoff, S., editors, Data Mining, Lecture Notes in Computer Science , pages 146--160. Springer, Berlin, Heidelberg

work page 2006
[59]

C., and Zaslavsky, A

Gutman, R., Afendulis, C. C., and Zaslavsky, A. M. (2013). A bayesian procedure for file linking to analyze end-of-life medical costs. Journal of the American Statistical Association , 108(501):34–47

work page 2013
[60]

Gutman, R., Sammartino, C., Green, T., and Montague, B. (2016). Error adjustments for file linking methods using encrypted unique client identifier (euci) with application to recently released prisoners who are hiv+. Statistics in Medicine , 35(1):115--129

work page 2016
[61]

S., Brandenburg, J

Haas, J. S., Brandenburg, J. A., Udvarhelyi, I. S., and Epstein, A. M. (1994). Creating a comprehensive database to evaluate health coverage for pregnant women: The completeness and validity of a computerized linkage algorithm. Medical Care , 32(10):1053--1057

work page 1994
[62]

and Fienberg, S

Hall, R. and Fienberg, S. (2012). Valid statistical inference on automatically matched files. In Domingo-Ferrer, J. and Muralidhar, K., editors, Proceedings of the International Conference on Privacy in Statistical Databases , pages 131--142

work page 2012
[63]

Han, Y. (2018). Statistical Inference Using Data From Multiple Files Combined Through Record Linkage. PhD thesis, University of Maryland

work page 2018
[64]

and Lahiri, P

Han, Y. and Lahiri, P. (2019). Statistical analysis with linked data . International Statistical Review , 87:1013 -- 1038

work page 2019
[65]

L., and Goldstein, H

Harron, K., Dibben, C., Boyd, J., Hjern, A., Azimaee, M., Barreto, M. L., and Goldstein, H. (2017). Challenges in administrative data linkage for research. Big Data & Society , 4(2)

work page 2017
[66]

Harron, K., Wade, A., Gilbert, R., Muller-Pebody, B., and Goldstein, H. (2014). Evaluating bias due to data linkage error in electronic healthcare records. BMC Medical Research Methodology , 14(1):1--10

work page 2014
[67]

Hof, M. H. P., Ravelli, A. C., and Zwinderman, A. H. (2017). A probabilistic record linkage model for survival data. Journal of the American Statistical Association , 112(520):1504--1515

work page 2017
[68]

Hof, M. H. P. and Zwinderman, A. H. (2012). Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables. Statistics in Medicine , 31(30):4231--4242

work page 2012
[69]

Hof, M. H. P. and Zwinderman, A. H. (2014). A mixture model for the analysis of data derived from record linkage. Statistics in Medicine , 34(1):74–92

work page 2014
[70]

and Schultz, L

Isaki, C. and Schultz, L. (1987). The effects of correlation and matching error on dual system estimation. Communications in Statistics - Theory and Methods , 16:2405--2427

work page 1987
[71]

Jaro, M. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association , (84):414--420

work page 1989
[72]

Jiang, J., Lahiri, P., and Wan, S.-M. (2002). A unified jackknife theory for empirical best prediction with m-estimation. The Annals of Statistics , 30:1782--1810

work page 2002
[73]

Kamat, G., Shan, M., and Gutman, R. (2023). Bayesian record linkage with variables in one file. Statistics in Medicine , 42:4931--4951

work page 2023
[74]

and Chambers, R

Kim, G. and Chambers, R. (2012a). Regression analysis under incomplete linkage. Computational Statistics and Data Analysis , 56(518):2756--2770

work page
[75]

and Chambers, R

Kim, G. and Chambers, R. (2012b). Regression analysis under probabilistic multi-linkage. Statistica Neerlandica , 66:64--79

work page
[76]

and Chambers, R

Kim, G. and Chambers, R. (2015). Unbiased regression estimation under correlated linkage errors. Stat , 4(1):32--45

work page 2015
[77]

Krewski, D., Dewanji, A., Wang, Y., Bartlett, S., Zielinski, J., and Mallick, R. (2005). The effect of record linkage errors on risk estimates in cohort mortality studies. Survey Methodology , 31:13–21

work page 2005
[78]

and Larsen, M

Lahiri, P. and Larsen, M. D. (2005). Regression analysis with linked data. Journal of the American Statistical Association , 100(469):222--230

work page 2005
[79]

Lariscy, J. T. (2011). Differential record linkage by hispanic ethnicity and age in linked mortality studies: Implications for the epidemiologic paradox. Journal of Aging and Health , 23(8):1263--1284

work page 2011
[80]

Larsen, M. D. (2002). Comments on hierarchical bayesian record linkage. In Proceedings of the Survey Methods Section , pages 1995--2000. American Statistical Association

work page 2002

Showing first 80 references.

[1] [1]

M., Abramowitz, J., Levenstein, M

Abowd, J. M., Abramowitz, J., Levenstein, M. C., McCue, K., Patki, D., Raghunathan, T. E., Rodgers, A. M., Shapiro, M. D., Wasi, N., and Zinsser, D. (2021). Finding needles in haystacks: Multiple-imputation record linkage using machine learning. Working Paper 21-35, Center for Economic Studies, U.S. Census Bureau

work page 2021

[2] [2]

and Sadinle, M

Aleshin-Guendel, S. and Sadinle, M. (2022). Multifile partitioning for record linkage and duplicate detection. Journal of the American Statistical Association , 0(0):1--10

work page 2022

[3] [3]

Asher, J., Resnick, D., Brite, J., Brackbill, R., and Cone, J. (2020). An introduction to probabilistic record linkage with a focus on linkage processing for wtc registries. International Journal of Environmental Research and Public Health , 17(18):6937

work page 2020

[4] [4]

and Christen, P

Baxter, R. and Christen, P. (2003). A comparison of fast blocking methods for record linkage, cmis technical report 03/139. In Proceedings of ACM SIGKDD'03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation , pages 39--48

work page 2003

[5] [5]

Belin, T. R. and Rubin, D. B. (1995). A method for calibrating false-match rates in record linkage. Journal of the American Statistical Association , 90(430):694–707

work page 1995

[6] [6]

Bilenko, M., Kamath, B., and Mooney, R. (2006). Adaptive blocking: Learning to scale up record linkage. In Proceedings of the Sixth IEEE International Conference on Data Mining , pages 87--96

work page 2006

[7] [7]

and Mooney, R

Bilenko, M. and Mooney, R. J. (2003). On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation , pages 7--12

work page 2003

[8] [8]

and Steorts, R

Binette, O. and Steorts, R. C. (2022). (almost) all of entity resolution. Science Advances , 8

work page 2022

[9] [9]

Bird, S. M. and King, R. (2018). Multiple systems estimation (or capture-recapture estimation) to inform public policy. Annual Review of Statistics and its Application , 5:95--118

work page 2018

[10] [10]

Bohensky, M. (2015). Bias in data linkage studies , chapter 4, pages 63--82. John Wiley and Sons, Ltd

work page 2015

[11] [11]

Brenner, H., Schmidtmann, I., and Stegmaier, C. (1997). Effects of record linkage errors on registry-based follow-up studies. Statistics in Medicine , 16(23):2633--2643

work page 1997

[12] [12]

Briscolini, D., Di Consiglio, L., Liseo, B., Tancredi, A., and Tuoto, T. (2018). New methods for small area estimation with linkage uncertainty. International Journal of Approximate Reasoning , 94:30--42

work page 2018

[13] [13]

Campbell, K., Deck, D., and Krupski, A. (2008). Record linkage software in the public domain: A comparison of link plus, the link king, and a `basic' deterministic algorithm. Health Informatics Journal , 14:5--15

work page 2008

[14] [14]

R., Resnick, D

Campbell, S. R., Resnick, D. M., Cox, C. S., and Mirel, L. B. (2021). Using supervised machine learning to identify efficient blocking schemes for record linkage. Statistical Journal of the IAOS , 37(2):673--680

work page 2021

[15] [15]

Z., Chretien, Y

Cangul, M. Z., Chretien, Y. R., Gutman, R., and Rubin, D. B. (2009). Testing treatment effects in unconfounded studies under model misspecification: Logistic regression, discretization, and their combination. Statistics in Medicine , 28(20):2531--2551

work page 2009

[16] [16]

Chambers, R. (2009). Regression analysis of probability-linked data. Statisphere Official Statistics , 4

work page 2009

[17] [17]

and Diniz da Silva , A

Chambers, R. and Diniz da Silva , A. (2020). Improved secondary analysis of linked data: A framework and an illustration . Journal of the Royal Statistical Society, Series A , 183:37--59

work page 2020

[18] [18]

Chambers, R., Salvati, N., Fabrizi, E., and Diniz da Silva , A. (2019). Domain estimation under informative linkage. Statistical Theory and Related Fields , 3(2):90--102

work page 2019

[19] [19]

L., Fabrizi, E., Ranalli, M

Chambers, R. L., Fabrizi, E., Ranalli, M. G., Salvati, N., and Wang, S. (2022). Robust regression using probabilistically linked data. WIREs Computational Statistics , page e1596

work page 2022

[20] [20]

Chipperfield, J. (2019). A weighting approach to making inference with probabilistically linked data. Statistica Neerlandica , 73(3):333--350

work page 2019

[21] [21]

Chipperfield, J. O. and Chambers, R. L. (2015). Using the bootstrap to account for linkage errors when analysing probabilistically linked categorical data. Journal of Official Statistics , 31(3):397

work page 2015

[22] [22]

Christen, P. (2007). A two-step classification approach to unsupervised record linkage. In Proceedings of the Sixth Australasian Conference on Data Mining and Analytics , volume 70, page 111–119

work page 2007

[23] [23]

Christen, P. (2008a). Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 151--159

work page

[24] [24]

Christen, P. (2008b). Automatic training example selection for scalable unsupervised record linkage. In Washio, T., Suzuki, E., Ting, K. M., and Inokuchi, A., editors, Advances in Knowledge Discovery and Data Mining: 12th Pacific-Asia Conference, PAKDD , pages 511--518

work page

[25] [25]

and Goiser, K

Christen, P. and Goiser, K. (2005). Assessing deduplication and data linkage quality: What to measure? In Proceedings of the Fourth Australasian Data Mining Conference

work page 2005

[26] [26]

and Goiser, K

Christen, P. and Goiser, K. (2007). Quality and complexity measures for data linkage and deduplication. In Quality Measures in Data Mining , pages 127--151. Springer

work page 2007

[27] [27]

Christen, P., Ranbaduge, T., and Schnell, R. (2020). Linking Sensitive Data: Methods and Techniques for Practical Privacy-Preserving Information Sharing . Springer, Cham

work page 2020

[28] [28]

Cochinwala, M., Kurien, V., Lalk, G., and Shasha, D. (2001). Efficient data reconciliation. Information Sciences , 137(1):1--15

work page 2001

[29] [29]

M., Schafer, J

Collins, L. M., Schafer, J. L., and Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures . Psychological Methods , 6:330--351

work page 2001

[30] [30]

J., and Piatek, R

Conti, G., Fr \"u hwirth-Schnatter, S., Heckman, J. J., and Piatek, R. (2014). Bayesian exploratory factor analysis. Journal of Econometrics , 183(1):31--57

work page 2014

[31] [31]

J., Olson, L

Cook, L. J., Olson, L. M., and Dean, J. M. (2001). Probabilistic Record Linkage: Relationships between File Sizes, Identifiers, and Match Weights . Methods of Information in Medicine , 40:196--203

work page 2001

[32] [32]

Copas, J. B. and Hilton, F. J. (1990). Record linkage: Statistical models for matching computer records. Journal of the Royal Statistical Society. Series A (Statistics in Society) , 153(3):287--320

work page 1990

[33] [33]

Daggy, J., Xu, H., Hui, S., and Grannis, S. (2014). Evaluating latent class models with conditional dependence in record linkage. Statistics in medicine , 33(24):4250--4265

work page 2014

[34] [34]

Dalzell, N. M. and Reiter, J. P. (2018). Regression modeling and file matching using possibly erroneous matching variables. Journal of Computational and Graphical Statistics , 27(4):728--738

work page 2018

[35] [35]

Dasylva, A., Titus, R.-C., and Thibault, C. (2014). Overcoverage in the 2011 canadian census. In Proceedings of Statistics Canada Symposium

work page 2014

[36] [36]

P., Laird, N

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) , 39(1):1--38

work page 1977

[37] [37]

and Tuoto, T

Di Consiglio , L. and Tuoto, T. (2015). Coverage evaluation on probabilistically linked data. Journal of Official Statistics , 31:415--429

work page 2015

[38] [38]

and Tuoto, T

Di Consiglio , L. and Tuoto, T. (2018). Population size estimation and linkage errors: The multiple lists case. Journal of Official Statistics , 34:889--908

work page 2018

[39] [39]

Doidge, J. C. and Harron, K. (2018). Demystifying probabilistic linkage: Common myths and misconceptions. International journal of population data science , 3:410(1)

work page 2018

[40] [40]

D'Orazio, M., Di Zio , M., and Scanu, M. (2006). Statistical Matching: Theory and Practice . Hoboken, NJ: Wiley

work page 2006

[41] [41]

B., Tyree, S., Meyer, A.-M., Meyer, A., Green, L., and Carpenter, W

Dusetzina, S. B., Tyree, S., Meyer, A.-M., Meyer, A., Green, L., and Carpenter, W. R. (2014). Linking data for health services research: A framework and instructional guide, rockville, md: Agency for healthcare research and quality (us)

work page 2014

[42] [42]

G., Verykios, V

Elfeky, M. G., Verykios, V. S., Elmagarmid, A. K., Ghanem, T. M., and Kuwait, W. A. R. (2003). Record linkage: A machine learning approach, a toolbox, and a digital government web service. Purdue e-Pubs. Purdue University, West Lafayette

work page 2003

[43] [43]

Enamorado, T., Fifield, B., and Imai, K. (2019). Using a probabilistic model to assist merging of large-scale administrative records. American Political Science Review, , 113:353--371

work page 2019

[44] [44]

Fellegi, I. P. and Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association , (64):1183--1210

work page 1969

[45] [45]

Fienberg, S. E. (1972). The multiple recapture census for closed populations and incomplete 2k contingency tables. Biometrika , 59:591--603

work page 1972

[46] [46]

Fienberg, S. E. and Manrique-Vallier, D. (2009). Integrated methodology for multiple systems estimation and record linkage using a missing data formulation. AStA Advances in Statistical Analysis , 93(1):49–60

work page 2009

[47] [47]

Fisher, R. A. and Yates, F. (1963). Statistical Tables for Biological, Agricultural and Medical Research . Oliver and Boyd: Edinburgh, UK

work page 1963

[48] [48]

Fortini, M., Liseo, B., Nuccitelli, A., and Scanu, M. (2001). On bayesian record linkage. Research in Official Statistics , 4:185--198

work page 2001

[49] [49]

huber sandwich estimator

Freedman, D. A. (2006). On the so-called “huber sandwich estimator” and “robust standard errors”. The American Statistician , 60(4):299--302

work page 2006

[50] [50]

B., Stern, H

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian data analysis . CRC press

work page 2013

[51] [51]

C., Smith, P., Dibben, C., and Goldstein, H

Gilbert, R., Lafferty, R., Hagger-Johnson, G., Harron, K., Zhang, L. C., Smith, P., Dibben, C., and Goldstein, H. (2017). Guild: Guidance for information about linking data sets. Journal of Public Health , 40:191--198

work page 2017

[52] [52]

and Mirel, L

Golden, C. and Mirel, L. B. (2021). Enhancement of health surveys with data linkage. In Chun, A. Y., Larsen, M., Durrant, G., and Reiter, J. P., editors, Administrative Records for Survey Methodology , pages 105--138. Wiley

work page 2021

[53] [53]

G., and Levin, K

Goldstein, H., Carpenter, J., Kenward, M. G., and Levin, K. A. (2009). Multilevel models with multivariate mixed response types. Statistical Modelling , 9(3):173--197

work page 2009

[54] [54]

and Harron, K

Goldstein, H. and Harron, K. (2015). Record linkage: A missing data problem. In Harron, K., Goldstein, H., and Dibben, C., editors, Methodological Developments in Data Linkage , volume 1, pages 109--124. John Wiley & Sons

work page 2015

[55] [55]

Goldstein, H., Harron, K., and Wade, A. (2012). The analysis of record-linked data using multiple imputation with data value priors. Statistics in Medicine , 31(28):3481--3493

work page 2012

[56] [56]

Gomatam, S., Carter, R., Ariet, M., and Mitchell, G. (2002). An empirical comparison of record linkage procedures. Statistics in Medicine , 21:1485--1496

work page 2002

[57] [57]

Green, P. J. and Mardia, K. V. (2006). Bayesian alignment using hierarchical models, with applications in protein bioinformatics. Biometrika , 93(2):235--254

work page 2006

[58] [58]

and Baxter, R

Gu, L. and Baxter, R. (2006). Decision models for record linkage. In Williams, G. and Simoff, S., editors, Data Mining, Lecture Notes in Computer Science , pages 146--160. Springer, Berlin, Heidelberg

work page 2006

[59] [59]

C., and Zaslavsky, A

Gutman, R., Afendulis, C. C., and Zaslavsky, A. M. (2013). A bayesian procedure for file linking to analyze end-of-life medical costs. Journal of the American Statistical Association , 108(501):34–47

work page 2013

[60] [60]

Gutman, R., Sammartino, C., Green, T., and Montague, B. (2016). Error adjustments for file linking methods using encrypted unique client identifier (euci) with application to recently released prisoners who are hiv+. Statistics in Medicine , 35(1):115--129

work page 2016

[61] [61]

S., Brandenburg, J

Haas, J. S., Brandenburg, J. A., Udvarhelyi, I. S., and Epstein, A. M. (1994). Creating a comprehensive database to evaluate health coverage for pregnant women: The completeness and validity of a computerized linkage algorithm. Medical Care , 32(10):1053--1057

work page 1994

[62] [62]

and Fienberg, S

Hall, R. and Fienberg, S. (2012). Valid statistical inference on automatically matched files. In Domingo-Ferrer, J. and Muralidhar, K., editors, Proceedings of the International Conference on Privacy in Statistical Databases , pages 131--142

work page 2012

[63] [63]

Han, Y. (2018). Statistical Inference Using Data From Multiple Files Combined Through Record Linkage. PhD thesis, University of Maryland

work page 2018

[64] [64]

and Lahiri, P

Han, Y. and Lahiri, P. (2019). Statistical analysis with linked data . International Statistical Review , 87:1013 -- 1038

work page 2019

[65] [65]

L., and Goldstein, H

Harron, K., Dibben, C., Boyd, J., Hjern, A., Azimaee, M., Barreto, M. L., and Goldstein, H. (2017). Challenges in administrative data linkage for research. Big Data & Society , 4(2)

work page 2017

[66] [66]

Harron, K., Wade, A., Gilbert, R., Muller-Pebody, B., and Goldstein, H. (2014). Evaluating bias due to data linkage error in electronic healthcare records. BMC Medical Research Methodology , 14(1):1--10

work page 2014

[67] [67]

Hof, M. H. P., Ravelli, A. C., and Zwinderman, A. H. (2017). A probabilistic record linkage model for survival data. Journal of the American Statistical Association , 112(520):1504--1515

work page 2017

[68] [68]

Hof, M. H. P. and Zwinderman, A. H. (2012). Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables. Statistics in Medicine , 31(30):4231--4242

work page 2012

[69] [69]

Hof, M. H. P. and Zwinderman, A. H. (2014). A mixture model for the analysis of data derived from record linkage. Statistics in Medicine , 34(1):74–92

work page 2014

[70] [70]

and Schultz, L

Isaki, C. and Schultz, L. (1987). The effects of correlation and matching error on dual system estimation. Communications in Statistics - Theory and Methods , 16:2405--2427

work page 1987

[71] [71]

Jaro, M. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association , (84):414--420

work page 1989

[72] [72]

Jiang, J., Lahiri, P., and Wan, S.-M. (2002). A unified jackknife theory for empirical best prediction with m-estimation. The Annals of Statistics , 30:1782--1810

work page 2002

[73] [73]

Kamat, G., Shan, M., and Gutman, R. (2023). Bayesian record linkage with variables in one file. Statistics in Medicine , 42:4931--4951

work page 2023

[74] [74]

and Chambers, R

Kim, G. and Chambers, R. (2012a). Regression analysis under incomplete linkage. Computational Statistics and Data Analysis , 56(518):2756--2770

work page

[75] [75]

and Chambers, R

Kim, G. and Chambers, R. (2012b). Regression analysis under probabilistic multi-linkage. Statistica Neerlandica , 66:64--79

work page

[76] [76]

and Chambers, R

Kim, G. and Chambers, R. (2015). Unbiased regression estimation under correlated linkage errors. Stat , 4(1):32--45

work page 2015

[77] [77]

Krewski, D., Dewanji, A., Wang, Y., Bartlett, S., Zielinski, J., and Mallick, R. (2005). The effect of record linkage errors on risk estimates in cohort mortality studies. Survey Methodology , 31:13–21

work page 2005

[78] [78]

and Larsen, M

Lahiri, P. and Larsen, M. D. (2005). Regression analysis with linked data. Journal of the American Statistical Association , 100(469):222--230

work page 2005

[79] [79]

Lariscy, J. T. (2011). Differential record linkage by hispanic ethnicity and age in linked mortality studies: Implications for the epidemiologic paradox. Journal of Aging and Health , 23(8):1263--1284

work page 2011

[80] [80]

Larsen, M. D. (2002). Comments on hierarchical bayesian record linkage. In Proceedings of the Survey Methods Section , pages 1995--2000. American Statistical Association

work page 2002