Optimization-based algorithms, such as Multi-Criteria Linear programming (MCLP), have shown their effectiveness in classification. Nevertheless, due to the limitation of computation power and memory, it is difficult to apply MCLP, or similar optimization methods, to huge datasets. As the size of today’s databases is continuously increasing, it is highly important that data mining algorithms are able to perform their functions regardless of dataset sizes. The objectives of this paper are: (1) to propose a new stratified random sampling and majority-vote ensemble approach, and (2) to compare this approach with the plain MCLP approach (which uses only part of the training set), and See5 (which is a decision-tree-based classification tool designed to analyze substantial datasets), on KDD99 and KDD2004 datasets. The results indicate that this new approach not only has the potential to handle arbitrary-size of datasets, but also outperforms the plain MCLP approach and achieves comparable classification accuracy to See5.
Peng, Yi; Kou, Gang; Shi, Yong; and Chen, Zhengxin, "Using Optimization-Based Classification Method for Massive Datasets" (2005). AMCIS 2005 Proceedings. 110.