A Comparison of Record Linkage Techniques

Lowell G. Mason

Abstract

It has become increasingly common to create new statistical products by integrating existing data rather than engaging in new data collection; using existing data sources is less expensive and does not increase respondent burden. However, it is usually not possible to satisfactorily integrate the multiple data sources without manual intervention. An example is the integration of the Bureau of Economic Analysis (BEA) enterprise-level data on Foreign Direct Investment (FDI) with establishment data from the Bureau of Labor Statistic's Quarterly Census of Wages and Employment (QCEW). In this particular case, the initial error rate was 87.7%. After manual review and correction, the error rate was reduced to 19.0%. The labor cost, however, was considerable: almost 1,510.5 hours. To reduce linkage error and labor costs, we implement several record linkage techniques. We consider supervised learning techniques, such as Support Vector Machines (SVM) and Random Forests. Finally, as a baseline comparison, we implement the methods developed by Fellegi and Sunter (1969).