Abstract

A filter feature selection process for text categorization system consists of two main stages: the term score computation stage and the term score ranking stage. In this work, we focus on the term score ranking stage. Our aim is to propose two novel term score ranking methods named the Term Score Ranking at Category Level (RatCL) and the Balanced Term Score Ranking at Total Level (BRatTL) to overcome the imbalance in the classification performance among categories after the filter feature selection process. The main idea of the RatCL is to focus on the category level instead of the total level and the document level to create a set of terms which covers all categories better. Contrary to the RatCL, the BRatTL is a method to improve a common method for the term score ranking at total level by smoothing the score of a term with respect to a category with the discrimination degree and size of the category. By these ways, both the RatCL and BRatTL avoid the preference towards categories which are easily distinguishable or which have more documents. The experimental results show the effectiveness of our proposed methods compared with two popular term score ranking methods such as the Term Score Ranking at Total Level and the Term Score Ranking at Document Level for three benchmark datasets (the Newsgroup dataset, the Reuters dataset, and the Ohsumed dataset).

Share

COinS