Abstract

In the paper, the issues related to classification based on dispersed data are considered. Dispersed data’s idea is to be able to effectively make use of data collected independently from different information systems/sources/units on a single topic in the development of a classification model that could classify a new object irrespective of the information systems/source/unit the object is from. As in Federated learning approaches, also here, data is protected and not shared between the owners. Local models are built using the bagging method and decision trees with the Twoing criterion. Only prediction vectors generated based on the local models are sent to central server. Final aggregation is done using majority voting. The main purpose of the paper is to study the quality of classification obtained with the proposed approach. Another goal is to investigate the impact of the pre-pruning tree process on the quality of classification. Moreover, the comparison of results obtained for the Twoing criterion and for the Gini index during the tree construction is presented. The experiments were performed on seventeen dispersed data sets, two of which reflect the natural dispersion that occurs in reality – dispersed medical data collected by different hospitals and dispersed medical data collected in different countries. The contribution of this paper is to observe the effectiveness of using Twoing criteria as a splitting criterion together with bagging method in development of classification model for data stored in independent dispersed sources.

Recommended Citation

Przybyła-Kasperek, M. & Aning, S. (2022). Study on the Twoing Criterion with Pre-pruning and Bagging Method for Dispersed Data. In R. A. Buchmann, G. C. Silaghi, D. Bufnea, V. Niculescu, G. Czibula, C. Barry, M. Lang, H. Linger, & C. Schneider (Eds.), Information Systems Development: Artificial Intelligence for Information Systems Development and Operations (ISD2022 Proceedings). Cluj-Napoca, Romania: Babeș-Bolyai University.

Paper Type

Full Paper

Share

COinS
 

Study on the Twoing Criterion with Pre-pruning and Bagging Method for Dispersed Data

In the paper, the issues related to classification based on dispersed data are considered. Dispersed data’s idea is to be able to effectively make use of data collected independently from different information systems/sources/units on a single topic in the development of a classification model that could classify a new object irrespective of the information systems/source/unit the object is from. As in Federated learning approaches, also here, data is protected and not shared between the owners. Local models are built using the bagging method and decision trees with the Twoing criterion. Only prediction vectors generated based on the local models are sent to central server. Final aggregation is done using majority voting. The main purpose of the paper is to study the quality of classification obtained with the proposed approach. Another goal is to investigate the impact of the pre-pruning tree process on the quality of classification. Moreover, the comparison of results obtained for the Twoing criterion and for the Gini index during the tree construction is presented. The experiments were performed on seventeen dispersed data sets, two of which reflect the natural dispersion that occurs in reality – dispersed medical data collected by different hospitals and dispersed medical data collected in different countries. The contribution of this paper is to observe the effectiveness of using Twoing criteria as a splitting criterion together with bagging method in development of classification model for data stored in independent dispersed sources.