•  
  •  
 

Management Information Systems Quarterly

Abstract

The importance of data quality assessment, in general, and duplicate detection, in particular, has been recognized in both research and practice. Duplicates are known to cause critical problems in many domains, including customer relationship management, data management and data warehousing, fraud detection, production, and healthcare. Such duplicates are caused by events that are typically associated with data patterns in the records. For example, duplicates in customer databases caused by the event “relocation of a customer” characteristically exhibit dissimilar values for address-related features, while the values of the other features (e.g., name-related features) tend to be highly similar. Analyzing duplicate-related events and recognizing such patterns seems particularly promising for duplicate detection. However, existing approaches do not take advantage of this and neither consider events nor recognize their associated data patterns when detecting potential duplicates. In this paper, we introduce events as causes of duplicates and, on this basis, propose a novel probability-based approach for duplicate detection. Our approach assigns the probability of being a duplicate to each analyzed pair of data while avoiding limiting methodical assumptions of existing approaches. The evaluation on seven different datasets shows that our approach is able to outperform existing approaches and that the models learned can be transferred between datasets of the same domain without further labeling or repeated model learning.

Share

COinS