Comparison of Pre-and Post-Resolution Blocking Strategies for Entity Resolution in a Distributed Computing Environment

Pei Wang, University of Arkansas for Medical Sciences
Daniel Pullen, University of Arkansas at Little Rock
John R. Talburt, UALR

Description

Data integration is critical for Customer Relationship Management. Organizations use Entity Resolution to integrate their customer data. Entity Resolution is an O(n2) problem where n is the number of records to be processed. Blocking is used to reduce the number of pair-wise comparisons. Blocking separates data into groups so that records most likely to match are within the same block. The ER system only compares pairs of records within the same block, thus reducing the total number of pairs to match. Traditionally, blocking algorithms built inverted indices in memory to quickly locate potential matches. With Big Data, processing has moved to a distributed environment of multiple processors to exploit parallel processing. By design, distributed processing environments do not have a single, shared memory space. This research describes the design, verification, and validation of two new blocking strategies used for Boolean rule based ER processes running in the Hadoop distributed environment.

 
Aug 10th, 12:00 AM

Comparison of Pre-and Post-Resolution Blocking Strategies for Entity Resolution in a Distributed Computing Environment

Data integration is critical for Customer Relationship Management. Organizations use Entity Resolution to integrate their customer data. Entity Resolution is an O(n2) problem where n is the number of records to be processed. Blocking is used to reduce the number of pair-wise comparisons. Blocking separates data into groups so that records most likely to match are within the same block. The ER system only compares pairs of records within the same block, thus reducing the total number of pairs to match. Traditionally, blocking algorithms built inverted indices in memory to quickly locate potential matches. With Big Data, processing has moved to a distributed environment of multiple processors to exploit parallel processing. By design, distributed processing environments do not have a single, shared memory space. This research describes the design, verification, and validation of two new blocking strategies used for Boolean rule based ER processes running in the Hadoop distributed environment.