In US breast cancer happens to possess the highest death rate apart from lung cancer. As of 2019, on average, 1 in 8 US women (approx. 12%) would develop invasive breast cancer at some point during her life. These statistics highlight the importance of early detection for increasing the mortality of patients. In recent years, machine learning (ML) techniques begin to play a key role in healthcare, especially as a diagnostic aid. In the case of breast cancer, ML techniques can be used to distinguish between malignant and benign tumours for enabling early detection. Moreover, accurate classification can assist physicians to guide patients and prescribe relevant treatment. Given this background, the objective of this paper is to apply ML algorithms to classify breast cancer outcomes. In this study, we build a platform using Ridge, AdaBoost, Gradient Boost, Random Forest, Principle Component Analysis (PCA) plus Ridge, and Neural Network ML algorithms for early breast cancer outcome detection. As a traditional benchmark technique, we use logistic regression model to compare against our chosen ML algorithms. We utilise the Wisconsin Breast Cancer Database (WBCD) dataset (Dua and Graff 2019). Although ML is generally deployed with large datasets, we highlight their usefulness and feasibility for small datasets in this study of only 30 features. We contribute to literature by providing a platform that will enable (a) big data analytics using small datasets and (b) high accuracy breast cancer outcome classifications. Specifically, we identify most important features in breast cancer outcome classification from a wide range of ML algorithms with a small dataset. This would enable health practitioners and patients to focus on these key features in their decision making for future breast cancer tests and subsequent early detection thus reducing analysis and decision latencies. In our ML based breast cancer classification platform, the user is required to make three function calls: data pre-processor, model generator and a single test. The pre-processor cleans the raw dataset from the user by removing 'NaN' and empty values, and it follows further instructions from a configuration file. After the pre-processing, the platform can train ML models from model generator based on two inputs, a cleaned dataset and a configuration file. Model generator creates different models from different ML algorithms specified in the study and generates corresponding evaluations. As such, the user can call single test to use the generated models in making predictions.
Jayasuriya, Dulani; Chan, Johnny; and Sundaram, David, "Big Data Analytics using Small Datasets: Machine Learning for Early Breast Cancer Detection" (2020). AMCIS 2020 TREOs. 57.
When commenting on articles, please be friendly, welcoming, respectful and abide by the AIS eLibrary Discussion Thread Code of Conduct posted here.