Abstract

The study of differential expressed genes (DEG) and making predictions from it is crucial for the advancements in numerous domains such as medical diagnostics and pharmaceutical and agricultural developments as it provides valuable insights into the underlying mechanisms of these sectors. Despite machine learning being one of the most indispensable tools for this, a persistent challenge arises from the high dimensionality of DEG data, where the number of features was notably larger than the number of available samples which would lead to overfitting. Therefore, to combat this critical issue we designed a data dimension adjustment method for DEG data called (DDA-DEG) which selects the important features and resamples the data to improve model generalization. The DDA-DEG method utilizes a GA-based (genetic algorithm) method for feature selection and a CVAE-based (conditional variational autoencoder) method for resampling. In our comprehensive evaluation, we compare the performance of DDA-DEG against four machine learning methods - Random Forest (RF), Support Vector Regression (SVR), Single Layer Feed-forward Neural Network (SLFN), and Extreme Learning Machine (ELM). Alongside GA and CVAE, PCC, PIMP and Robust Regression was also used for feature selection and RO, SMOTER and SMOGN for resampling. But as, GA and CVAE came out on top, the further experiments were continued with them. The results of our extensive experiments reveal that RF exhibits the best RMSE amongst the four methods with raw data (RMSE = 0.1509). Furthermore, applying DDA-DEG leads to a significant improvement in SVR’s performance, reducing its RMSE from 0.1576 with raw data to 0.095 with feature selection and 0.093 with resampling. This significant improvement in results with DDA-DEG undeniably demonstrates the effectiveness of our proposed method in overcoming overfitting and improving predictive performance.

Share

COinS