Start Date

11-12-2016 12:00 AM

Description

Poor data quality is one of the primary issues facing big data projects. Cleaning data and improving quality can be expensive and time-intensive. In data warehouse projects, data cleaning is estimated to account for 30% to 80% of the project's development time and budget. Data quality mining is one method used to identify errors that has become increasingly popular in the past 20 years. Our research-in-progress aims to identify multi-field errors via the mining of functional dependencies. Existing research on data quality mining and functional dependencies has focused on improving algorithms to identify a higher percentage of complex errors. The proposed process strives to introduce an efficient method for expediting error identification and increasing a user's domain knowledge in order to reduce the costs associated with cleaning; the process will also include an assessment of when further cleaning is unlikely to be cost effective.

Share

COinS
 
Dec 11th, 12:00 AM

Mining Domain Knowledge: Using Functional Dependencies to Profile Data

Poor data quality is one of the primary issues facing big data projects. Cleaning data and improving quality can be expensive and time-intensive. In data warehouse projects, data cleaning is estimated to account for 30% to 80% of the project's development time and budget. Data quality mining is one method used to identify errors that has become increasingly popular in the past 20 years. Our research-in-progress aims to identify multi-field errors via the mining of functional dependencies. Existing research on data quality mining and functional dependencies has focused on improving algorithms to identify a higher percentage of complex errors. The proposed process strives to introduce an efficient method for expediting error identification and increasing a user's domain knowledge in order to reduce the costs associated with cleaning; the process will also include an assessment of when further cleaning is unlikely to be cost effective.