Communications of the Association for Information Systems


The increase in the number of mobile devices that use the Android operating system has attracted the attention of cybercriminals who want to disrupt or gain unauthorized access to them through malware infections. To prevent such malware, cybersecurity experts and researchers require datasets of malware samples that most available antivirus software programs cannot detect. However, researchers have infrequently discussed how to identify evolving Android malware characteristics from different sources. In this paper, we analyze a wide variety of Android malware datasets to determine more discriminative features such as permissions and intents. We then apply machine-learning techniques on collected samples of different datasets based on the acquired features’ similarity. We perform random sampling on each cluster of collected datasets to check the antivirus software’s capability to detect the sample. We also discuss some common pitfalls in selecting datasets. Our findings benefit firms by acting as an exhaustive source of information about leading Android malware datasets.