Probabilistic Linkage of Records


Introduction to Record Linkage Probabilistic linkage allows you to combine different data sets into one extensive data set that can be available for analysis. Probabilistic linkage can be important to injury surveillance systems, crash evaluation systems, and other public health programs. For example, your linked data set might permit you to follow a crash victim from the crash site through the EMS system into the hospital and out into a rehabilitation program. Outcome variables could be related to crash characteristics, such as seat belt use or driver alcohol involvement. NEDARC can provide technical assistance so that you can accomplish the necessary linkage between data files to answer these types of questions, even if the data files were not prospectively intended to be linked.
It is useful to consider how one might manually link two different sets of records. Consider the hypothetical situation in which you have 100 ambulance run sheets and 100 inpatient charts. Assume for simplicity that we know that all 100 EMS records link to the 100 inpatient records. How would you approach this problem?

One approach is to take the first ambulance run sheet off the top of the 100 run sheets, read its contents, and then read through all 100 inpatient records. As you compare the data field contents from that record to each of the 100 inpatient records, you come to a "ranking" of the inpatient records, deciding which inpatient record is most likely to link to the ambulance run sheet. After you have made the 100 comparisons, you take the ambulance run sheet and attach it to the inpatient record. You then set the two aside. They are "linked".

You then take ambulance run sheet #2, and you compare it with the other 99 inpatient records. Unfortunately, however, you may have misassigned the first inpatient record, since perhaps it would be better assigned to this record. So, back to the drawing board.

The second approach is to compare each ambulance run sheet with each inpatient record. This exhaustive method is certain to be the best one from a precision standpoint. You do all the comparisons and then make a rank ordering of the best pairs of records. The difficulty with this method is that you must make 100 X 100 comparisons, or 10,000 comparisons. This will be tedious for the most boring of clerical staff you might be able to hire!

There is a much more serious limitation to this exhaustive approach. Utah has relatively small data files, but these files form a real life example of the limitation. Utah has approximately 65,000 computerized ambulance run sheets. She also has approximately 110,000 crash victim records per year. Finally, in real life, only a fraction of the crash victims will actually link to ambulance run sheets, since most of the victims are not injured. In Utah, approximately 12% of the crash victims are subsequently involved with prehospital caregivers. If we use the exhaustive approach, we will have to do 65000 X 110000 comparisons, or, 6,150,000,000 pairwise comparisons! No matter how tedious a computer enjoys its work, this is a large number of comparisons. If we had a computer that would accomplish 1 million comparisons per second, it would require 2 hours to do the comparisons. Unfortunately, no such computer exists at present. Using microcomputer platforms, we may be able to do 10,000 comparisons per second; then our example will require 200 hours. Take it to the next level, and try a state like California. The files may contain millions of records; exhaustive search will not be technically feasible even on the fastest supercomputers available today.


Questions about probabilistic linkage should be sent to J. Michael Dean, M.D..
Comments about this page should also be sent to J. Michael Dean, M.D.


Last updated February 26, 1996 J. Michael Dean, M.D.

Return to top of page.

Home   Staff   Publications   Linkage   Data   Traffic Safety   TRCC   Other Resources

Utah CODES (Crash Outcome Data Evaluation System)

 615 Arapeen Dr, Suite 202 Salt Lake City, UT 84108-1226 
Ph: (801) 581-6410, Fax: (801) 581-8686
General Information: larry.cook@hsc.utah.edu Website: IICRC Website