Description
While current machine learning methods can detect financial fraud more effectively, they suffer from a common problem: dataset imbalance, i.e. there are substantially more non-fraud than fraud cases. In this paper, we propose the application of generative adversarial networks (GANs) to generate synthetic fraud cases on a dataset of public firms convicted by the United States Securities and Exchange Commission for accounting malpractice. This approach aims to increase the prediction accuracy of a downstream logit, support vector machine (SVM), and eXtreme Gradient Boosting (XGBoost) classifier by training on a more well-balanced dataset. While the results indicate that a state-of-the-art machine learning model like XGBoost can outperform previous fraud detection models on the same data, generating synthetic fraud cases before applying a machine learning model does not improve performance.
Recommended Citation
Fukas, Philipp; Menzel, Lukas; and Thomas, Oliver, "Augmenting Data with Generative Adversarial Networks to Improve Machine Learning-Based Fraud Detection" (2022). Wirtschaftsinformatik 2022 Proceedings. 4.
https://aisel.aisnet.org/wi2022/analytics_talks/analytics_talks/4
Augmenting Data with Generative Adversarial Networks to Improve Machine Learning-Based Fraud Detection
While current machine learning methods can detect financial fraud more effectively, they suffer from a common problem: dataset imbalance, i.e. there are substantially more non-fraud than fraud cases. In this paper, we propose the application of generative adversarial networks (GANs) to generate synthetic fraud cases on a dataset of public firms convicted by the United States Securities and Exchange Commission for accounting malpractice. This approach aims to increase the prediction accuracy of a downstream logit, support vector machine (SVM), and eXtreme Gradient Boosting (XGBoost) classifier by training on a more well-balanced dataset. While the results indicate that a state-of-the-art machine learning model like XGBoost can outperform previous fraud detection models on the same data, generating synthetic fraud cases before applying a machine learning model does not improve performance.