Start Date

10-12-2017 12:00 AM

Description

Many organizations are starting to make datasets, such as customer review data and service usage logs. To protect the privacy of involved individuals, these datasets are usually pseudonymized or anonymized before they are released. A method called k-anonymization is widely used in such open datasets. Recent literature showed that this method, however, can be unsafe and compromise individuals’ privacy. In this paper, we address this problem by analyzing the New York Citi Bike dataset. Through our analyses, we show that given some generalized and payload data, it is possible to recover other payload data of an individual in the k-anonymized dataset. We also demonstrate that it is possible to achieve a high success rate in re-identification of records. These findings shed additional light on the weakness of the k-anonymization method, thus evidencing a trade-off between data availability and privacy protection. We finally provide some implications for both academics and practitioners.

Share

COinS
 
Dec 10th, 12:00 AM

Angels and Daemons: Is more Knowledge better than less Privacy? An Empirical Study on a K-anonymized openly available Dataset

Many organizations are starting to make datasets, such as customer review data and service usage logs. To protect the privacy of involved individuals, these datasets are usually pseudonymized or anonymized before they are released. A method called k-anonymization is widely used in such open datasets. Recent literature showed that this method, however, can be unsafe and compromise individuals’ privacy. In this paper, we address this problem by analyzing the New York Citi Bike dataset. Through our analyses, we show that given some generalized and payload data, it is possible to recover other payload data of an individual in the k-anonymized dataset. We also demonstrate that it is possible to achieve a high success rate in re-identification of records. These findings shed additional light on the weakness of the k-anonymization method, thus evidencing a trade-off between data availability and privacy protection. We finally provide some implications for both academics and practitioners.