Abstract

Clustering-based data masking approaches are widely used for privacy-preserving data sharing and data mining. Existing approaches, however, cannot cope with the situation where confidential attributes are categorical. For numeric data, these approaches are also unable to preserve important statistical properties such as variance and covariance of the data. We propose a new approach that handles these problems effectively. The proposed approach adopts a minimum spanning tree technique for clustering data and a micro-perturbation method for masking data. Our approach is novel in that it (i) incorporates an entropy-based measure, which represents the disclosure risk of the categorical confidential attribute, into the traditional distance measure used for clustering in an innovative way; and (ii) introduces the notion of cluster-level microperturbation (as opposed to conventional micro-aggregation) for masking data, to preserve the statistical properties of the data. We provide both analytical and empirical justification for the proposed methodology.

Share

COinS