Abstract

Automated document-category management, particularly the document clustering, represents an appealing alternative of supporting a user's search, access, and utilization of the ever-increasing corpora of textual. Traditional document clustering techniques generally emphasize on the analysis of document contents and measure document similarity on the basis of the overlap between or among the feature vectors representing individual document. However, it can be problematic and cannot address word mismatch or ambiguity effectively to cluster document at the lexical level. To address problems inherent to the traditional lexicon-based approach, we propose an Ontology-based Document Clustering (ODC) technique, which employs a domain-specific ontology to support the proceeding of document clustering at the conceptual level. We empirically evaluate the effectiveness of the proposed ODC technique, using the lexicon-based and LSI-based document clustering techniques (i.e., HAC and LSI-based HAC) for evaluation purpose. Our comparative analysis results show ODC to be partially effective than HAC and LSI-based HAC, showing higher cluster precision across all levels of cluster recall and statistically significant in F1 measure. In addition, our preliminary analysis on the effect of granularity of concept hierarchy suggests the usage of fine-grained concept hierarchy can make ODC reach to a better performance. Our findings have interesting implications to research and practice, which are discussed together with our future research directions.

Share

COinS