Probabilistic Linkage of Records
Introduction to Record Linkage Probabilistic linkage allows
you to combine different data sets into one extensive data set
that can be available for analysis. Probabilistic linkage can be
important to injury surveillance systems, crash evaluation
systems, and other public health programs. For example, your
linked data set might permit you to follow a crash victim from
the crash site through the EMS system into the hospital and out
into a rehabilitation program. Outcome variables could be related
to crash characteristics, such as seat belt use or driver alcohol
involvement. NEDARC can provide technical assistance so that you
can accomplish the necessary linkage between data files to answer
these types of questions, even if the data files were not
prospectively intended to be linked.
It is useful to consider how one might manually link two
different sets of records. Consider the hypothetical situation in
which you have 100 ambulance run sheets and 100 inpatient charts.
Assume for simplicity that we know that all 100 EMS records link
to the 100 inpatient records. How would you approach this
problem?
One approach is to take the first ambulance run sheet off the top
of the 100 run sheets, read its contents, and then read through
all 100 inpatient records. As you compare the data field contents
from that record to each of the 100 inpatient records, you come
to a "ranking" of the inpatient records, deciding which
inpatient record is most likely to link to the ambulance run
sheet. After you have made the 100 comparisons, you take the
ambulance run sheet and attach it to the inpatient record. You
then set the two aside. They are "linked".
You then take ambulance run sheet #2, and you compare it with the
other 99 inpatient records. Unfortunately, however, you may have
misassigned the first inpatient record, since perhaps it would be
better assigned to this record. So, back to the drawing board.
The second approach is to compare each ambulance run sheet with
each inpatient record. This exhaustive method is certain to be
the best one from a precision standpoint. You do all the
comparisons and then make a rank ordering of the best pairs of
records. The difficulty with this method is that you must make
100 X 100 comparisons, or 10,000 comparisons. This will be
tedious for the most boring of clerical staff you might be able
to hire!
There is a much more serious limitation to this exhaustive
approach. Utah has relatively small data files, but these files
form a real life example of the limitation. Utah has
approximately 65,000 computerized ambulance run sheets. She also
has approximately 110,000 crash victim records per year. Finally,
in real life, only a fraction of the crash victims will actually
link to ambulance run sheets, since most of the victims are not
injured. In Utah, approximately 12% of the crash victims are
subsequently involved with prehospital caregivers. If we use the
exhaustive approach, we will have to do 65000 X 110000
comparisons, or, 6,150,000,000 pairwise comparisons! No matter
how tedious a computer enjoys its work, this is a large number of
comparisons. If we had a computer that would accomplish 1 million
comparisons per second, it would require 2 hours to do the
comparisons. Unfortunately, no such computer exists at present.
Using microcomputer platforms, we may be able to
do 10,000 comparisons per second; then our example will require
200 hours. Take it to the next level, and try a state like
California. The files may contain millions of records; exhaustive
search will not be technically feasible even on the fastest
supercomputers available today.
Questions about probabilistic linkage should be sent to J. Michael Dean, M.D..
Comments about this page should also be sent to J. Michael Dean, M.D.
Last updated February 26, 1996 J. Michael Dean, M.D.
Return to top of page.
|