Abstract

The data mining research community is increasingly addressing data quality issues, including problems of dirty data. Hand, Blunt, Kelly and Adams (2000) have identified high-level and low-level quality issues in data mining. Kim, Choi, Hong, Kim and Lee (2003) have compiled a useful, complete taxonomy of dirty data that provides a starting point for research in effective techniques and fast algorithms for preprocessing data, and ways to approach the problems of dirty data. In this study we create a classification scheme for data errors by transforming their general taxonomy to apply to very large multiple-source secondary datasets. These types of datasets are increasingly being compiled by organizations for use in their data mining applications. We contribute this classification scheme to the body of research addressing quality issues in the very large multiple-source secondary datasets that are being built through today’s global organizations’ massive data collection from the Internet.

Share

COinS