Abstract

Because of the heterogeneous nature of multiple data sources, data integration is often one of the most challenging tasks of today’s information systems. While the existing literature has focused on problems such as schema integration and entity identification, our current study attempts to answer a basic question: When an attribute value for a real-world entity is recorded differently in two databases, how should the “best” value be chosen from the set of possible values? We first show how probabilities for attribute values can be derived, and then propose a framework for deciding the cost-minimizing value based on the total cost of type I, type II, and misrepresentation errors.

Share

COinS