An Automated Record
Linkage System for the Canadian Census, 1871-81
The goal of the project is to create longitudinal data linking persons across
censuses from 1851 to 1911. This longitudinal data would be used by
researchers to investigate historical trends and to address questions
about society, history and economy; this comparative, systematic research
would not be possible without the linked data. Automated record linkage is
not a new field, and our project builds on systems that have been studied for
the last decades. Typical record linkage systems are probabilistic ones based
on early contributions (Newcombe '59 '62, Fellegi and Sunter '69), but advances
in machine learning, data mining and statistics have pushed the development of
new record linkage systems with better performances (Winkler '06, Elmagarmid et
al. '07). However, there are still major issues that have to be addressed.
First, these systems rely on the existence of training data (pre-labeled pairs
of records), which is expensive to acquire and limited at best when available.
Second, these systems are computationally expensive and do not scale well to
millions of records. Third, there is no standard benchmark to compare these
systems against. We develop a record linkage system that incorporates
a supervised learning module for classifying pairs of entities as matches or non-matches. Our paper presents the record
linkage system developed to link persons from the 1871 Canadian census
to the 1881 Canadian Census. We discuss our
methodology used in designing and developing the record linkage system. We talk
about data preparation to deal with the data heterogeneity. We discuss the
choice of similarity measures to
compare pairs of records and we present the classification system along
with the pre-labeled pairs of records used in the classification process.