Comparison of Pre-and Post-Resolution Blocking Strategies for Entity Resolution in a Distributed Computing Environment
Description
Data integration is critical for Customer Relationship Management. Organizations use Entity Resolution to integrate their customer data. Entity Resolution is an O(n2) problem where n is the number of records to be processed. Blocking is used to reduce the number of pair-wise comparisons. Blocking separates data into groups so that records most likely to match are within the same block. The ER system only compares pairs of records within the same block, thus reducing the total number of pairs to match. Traditionally, blocking algorithms built inverted indices in memory to quickly locate potential matches. With Big Data, processing has moved to a distributed environment of multiple processors to exploit parallel processing. By design, distributed processing environments do not have a single, shared memory space. This research describes the design, verification, and validation of two new blocking strategies used for Boolean rule based ER processes running in the Hadoop distributed environment.
Comparison of Pre-and Post-Resolution Blocking Strategies for Entity Resolution in a Distributed Computing Environment
Data integration is critical for Customer Relationship Management. Organizations use Entity Resolution to integrate their customer data. Entity Resolution is an O(n2) problem where n is the number of records to be processed. Blocking is used to reduce the number of pair-wise comparisons. Blocking separates data into groups so that records most likely to match are within the same block. The ER system only compares pairs of records within the same block, thus reducing the total number of pairs to match. Traditionally, blocking algorithms built inverted indices in memory to quickly locate potential matches. With Big Data, processing has moved to a distributed environment of multiple processors to exploit parallel processing. By design, distributed processing environments do not have a single, shared memory space. This research describes the design, verification, and validation of two new blocking strategies used for Boolean rule based ER processes running in the Hadoop distributed environment.