An Automated Record Linkage System for the Canadian Census, 1871-81


The goal of the project is to create longitudinal data linking persons across censuses from 1851 to 1911. This longitudinal data would be used by researchers to investigate historical trends and to address questions about society, history and economy; this comparative, systematic research would not be possible without the linked data. Automated record linkage is not a new field, and our project builds on systems that have been studied for the last decades. Typical record linkage systems are probabilistic ones based on early contributions (Newcombe '59 '62, Fellegi and Sunter '69), but advances in machine learning, data mining and statistics have pushed the development of new record linkage systems with better performances (Winkler '06, Elmagarmid et al. '07). However, there are still major issues that have to be addressed. First, these systems rely on the existence of training data (pre-labeled pairs of records), which is expensive to acquire and limited at best when available. Second, these systems are computationally expensive and do not scale well to millions of records. Third, there is no standard benchmark to compare these systems against. We develop a record linkage system that incorporates a supervised learning module for classifying pairs of entities as matches or non-matches. Our paper presents the record linkage system developed to link persons from the 1871 Canadian census to the 1881 Canadian Census. We discuss our methodology used in designing and developing the record linkage system. We talk about data preparation to deal with the data heterogeneity. We discuss the choice of  similarity measures to compare pairs of records and we present the classification system along with the pre-labeled pairs of records used in the classification process.