Abstract

In the big data era, analysis with data sets becomes more and more important. How to obtain valuable information from the data records is all we care about. However, most of the time, there are outliers among the data records. Outliers can lead to wrong information extracted from the data sets, detecting them can help us modify these rules or get them easier. In this paper, we combine the distance-based and clustering-based outlier detection methods, use the theory of minimum spanning tree and standard normal distribution to define a new method of outlier detection. At the same time, our algorithm can find the data records which we should pay attention to in the data sets. The algorithm works with two phases. During the first phase, we build a minimum spanning tree by all data records, compute the average weight and the standard deviation of it. In the second phase, we use the distance of each data record with its $K$ nearest neighbours to discover the outliers. Experimental results show our algorithm is more accurate and efficient.

Share

COinS