Abstract

Regression techniques can be used not only for legitimate data analysis, but also to infer private information about individuals. In this paper, we demonstrate that regression trees, a popular data-mining technique, can be used to effectively reveal individuals' sensitive data. This problem, which we call a "regression attack," has been overlooked in the literature. Existing privacy-preserving techniques are not appropriate in coping with this problem. We propose a new approach to counter regression attacks. To protect against privacy disclosure, our approach adopts a novel measure which considers the tradeoff between disclosure risk and data utility in a regression tree pruning process. We also propose a dynamic value-concatenation method, which overcomes the limitation of requiring a user-defined generalization hierarchy in traditional k-anonymity approaches. Our approach can be used for anonymizing both numeric and categorical data. An experimental study is conducted to demonstrate the effectiveness of the proposed approach.

Share

COinS
 

Protecting Privacy Against Regression Attacks in Predictive Data Mining

Regression techniques can be used not only for legitimate data analysis, but also to infer private information about individuals. In this paper, we demonstrate that regression trees, a popular data-mining technique, can be used to effectively reveal individuals' sensitive data. This problem, which we call a "regression attack," has been overlooked in the literature. Existing privacy-preserving techniques are not appropriate in coping with this problem. We propose a new approach to counter regression attacks. To protect against privacy disclosure, our approach adopts a novel measure which considers the tradeoff between disclosure risk and data utility in a regression tree pruning process. We also propose a dynamic value-concatenation method, which overcomes the limitation of requiring a user-defined generalization hierarchy in traditional k-anonymity approaches. Our approach can be used for anonymizing both numeric and categorical data. An experimental study is conducted to demonstrate the effectiveness of the proposed approach.