Abstract

Data integration is an important topic in the information age. Although structural aspects are widely investigated, there is a lack of research on semantic discrepancies between data sources. Data integration should be able to handle input errors such as erroneous data and misspellings. Also problems like domain and data type mismatch, of missing values and duplicated records need investigation. Object identification is essential for the task of integration, especially if keys are absent or incorrect. This approach utilizes properties, which can be derived from the data sources used for identification - the derivable attributes. Two sources given, the values of the derivable attributes of pairs of records are compared and classified. A random sample of pairs is used for detecting similarities, rules or classification criteria. Different Statistical or Data Mining Techniques can be applied to classify pairs of records from two sources in order to link them or not.

Share

COinS