Abstract

Unsupervised models are becoming increasingly common in business processes. They are extremely effective in cases where we don't have a clearly defined decision class or the data contains anomalies that are hard to identify. The problem emerges in the effective processing of categorical data. Recently, many new approaches have been designed to analyze this data type. Still, most of them do not address the issue of unbalanced datasets, which is extremely difficult to catch when dealing with unlabeled data. Moreover, it is sometimes challenging to determine when abnormal observations represent a small data cluster and when they are already anomalies. This research analyzes several less popular algorithms that solve this problem and automatically place abnormal observations into separated clusters. We have shown that such methods are much better at clustering unbalanced data but also perfectly detect outliers in categorical datasets.

Recommended Citation

Łazarz, W. & Nowak-Brzezińska, A. (2024). Detecting Outliers in Context of Clustering Imbalanced Categorical Data. In B. Marcinkowski, A. Przybylek, A. Jarzębowicz, N. Iivari, E. Insfran, M. Lang, H. Linger, & C. Schneider (Eds.), Harnessing Opportunities: Reshaping ISD in the post-COVID-19 and Generative AI Era (ISD2024 Proceedings). Gdańsk, Poland: University of Gdańsk. ISBN: 978-83-972632-0-8. https://doi.org/10.62036/ISD.2024.35

Paper Type

Full Paper

DOI

10.62036/ISD.2024.35

Share

COinS
 

Detecting Outliers in Context of Clustering Imbalanced Categorical Data

Unsupervised models are becoming increasingly common in business processes. They are extremely effective in cases where we don't have a clearly defined decision class or the data contains anomalies that are hard to identify. The problem emerges in the effective processing of categorical data. Recently, many new approaches have been designed to analyze this data type. Still, most of them do not address the issue of unbalanced datasets, which is extremely difficult to catch when dealing with unlabeled data. Moreover, it is sometimes challenging to determine when abnormal observations represent a small data cluster and when they are already anomalies. This research analyzes several less popular algorithms that solve this problem and automatically place abnormal observations into separated clusters. We have shown that such methods are much better at clustering unbalanced data but also perfectly detect outliers in categorical datasets.