Abstract
As artificial intelligence, data science, machine learning, and deep learning continue to advance, their applications span broader domains. Understanding financial transaction behaviors is crucial for financial institutions to predict and prevent fraudulent activities, improve customer satisfaction, and enhance decision-making processes [1]. This research addresses the critical need for accurate predictive models in financial transactions, providing a framework that can be applied to similar datasets in the industry. This paper delves into this evolving landscape by applying the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology to a financial dataset. Implementing CRISP-DM and Logistic Regression for predictive analysis is versatile and widely applicable, extending beyond financial topics to effectively predict diabetes as well [2]. CRISP-DM offers a structured approach to planning data mining and machine learning projects through its six sequential phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. Logistic regression was selected as the predictive model for this study due to its effectiveness in binary classification, interpretability, robustness, and simplicity [3]. The dataset analyzed consists of 200,000 rows and 200 numerical variables [4]. Exploratory Data Analysis (EDA) was conducted to understand the data's characteristics. The data preparation phase involved selecting relevant data, cleaning it, and transforming it for modeling. No missing values were found, and the dataset was split into training (70%) and validation (30%) sets. Logistic regression was chosen due to its effectiveness in binary classification problems. The model was trained on the training set and evaluated on the validation set. The evaluation metrics indicated high accuracy (0.9146) and specificity (0.9867) but low sensitivity (0.2689), suggesting the need for further refinement to improve sensitivity.
Predictive modeling in financial transactions is a well-explored area, with numerous studies employing various machine learning techniques. Prior research has demonstrated the utility of methods like decision trees, neural networks, and support vector machines in predicting financial behaviors [5]. This study builds on this foundation by integrating the CRISP-DM methodology, ensuring a systematic approach to data mining projects. The primary contribution of this paper is the application of the CRISP-DM methodology to a realworld financial dataset, showcasing its effectiveness in structuring data mining projects. The novelty lies in the detailed implementation of logistic regression within this framework, highlighting the model's strengths and weaknesses and suggesting potential improvements for better predictive performance.
The Business Understanding phase involved understanding the financial domain and specific challenges faced by Santander Bank in predicting customer transaction behaviors. By analyzing transaction data, financial institutions can forecast future transaction patterns and identify fraudulent activities. The Data Understanding phase involved EDA to understand the characteristics of the data. The dataset consists of 200,000 rows and 200 numerical variables, with 10.049% of the target values being 1, indicating a categorical problem. The Data Preparation phase involved selecting relevant data, cleaning it, and transforming it for modeling. No missing values were found, and the dataset was split into training (70%) and validation (30%) sets. Modeling involved choosing logistic regression due to its effectiveness in binary classification problems. The model was trained on the training set and evaluated on the validation set. Evaluation metrics indicated high accuracy (0.9146) and specificity (0.9867) but low sensitivity (0.2689).
The deployment phase involves implementing the model in a production environment to assist in decision-making processes at Santander Bank. The application of the CRISP-DM methodology provided a structured approach to predictive analysis in financial transactions. Logistic regression proved to be a robust model for this dataset, though improvements in sensitivity are needed. Future work will explore the use of other models, such as boosting and deep learning models, to enhance predictive performance. The current model exhibits high accuracy and specificity, indicating its reliability in predicting non-event outcomes accurately. However, the low sensitivity suggests that the model struggles with correctly identifying event outcomes (true positives). To address this, future work should consider implementing techniques such as resampling methods (e.g., Synthetic Minority Oversampling Technique) to balance the dataset or experimenting with different algorithms that might better capture the minority class characteristics, such as ensemble methods like Random Forest or boosting algorithms like XGBoost. Additionally, fine-tuning the logistic regression model with different regularization techniques could also improve sensitivity. Another direction for future research could involve incorporating more features that may be indicative of the target variable, potentially improving the model's ability to distinguish between classes. A deeper analysis of feature importance and correlation with the target variable could yield insights into new predictive features. Furthermore, deploying the model in a real-world setting would require continuous monitoring and retraining with new data to maintain its accuracy and reliability.
References
[1] Shen, S., Jiang, H., & Zhang, T. (2012). Stock market forecasting using machine learning algorithms. Department of Electrical Engineering, Stanford University, Stanford, CA, 1- 5.
[2] Maydanchi, M., Ziaei, M. ., Mohammadi, . M. ., Ziaei, . A. ., Basiri, M. ., Haji, F. ., & Gharibi, K. . (2024). A Comparative Analysis of the Machine Learning Methods for Predicting Diabetes. Journal of Operations Intelligence, 2(1), 230-251.
[3] Schröer, C., Kruse, F., & Gómez, J. M. (2021). A Systematic Literature Review on Applying CRISP-DM Process Model. Procedia Computer Science, 181, 526-534.
[4] https://en.wikipedia.org/wiki/Santander_Bank
[5] Dehghani, Farbod, and Ata Larijani. (2023). An Algorithm for Predicting Stock Market's Index Based on MID Algorithm and Neural Network. Available at SSRN 4448033 (2023).
Recommended Citation
Azim Basiri, Mohammad; Parvin, Mohammad; and Sobhani, Zahra, "Implementing CRISP-DM and Logistic Regression for Predictive Analysis in Financial Transactions: A Case Study" (2024). IRAIS 2024 Proceedings. 6.
https://aisel.aisnet.org/irais2024/6
Abstract Only