PACIS 2018 Proceedings

Text Classification in an Under-Resourced Language via Lexical Normalization and Feature Pooling

Omayya Sohail, Information Technology University of the PunjabFollow
Inam Elahi, Information Technology University of the PunjabFollow
Ahsan Ijaz, ADDO AIFollow
Asim Karim, Lahore University of Management SciencesFollow
Faisal Kamiran, Information Technology University of the PunjabFollow

Abstract

Automatic classification of textual content in an under-resourced language is challenging, since lexical resources and preprocessing tools are not available for such languages. Their bag-of-words (BoW) representation is usually highly sparse and noisy, and text classification built on such a representation yields poor performance. In this paper, we explore the effectiveness of lexical normalization of terms and statistical feature pooling for improving text classification in an under-resourced language. We focus on classifying citizen feedback on government services provided through SMS texts which are written predominantly in Roman Urdu (an informal forward transliterated version of the Urdu language). Our proposed methodology performs normalization of lexical variations of terms using phonetic and string similarity. It subsequently employs a supervised feature extraction technique to obtain category-specific highly discriminating features. Our experiments with classifiers reveal that significant improvement in classification performance is achieved by lexical normalization plus feature pooling over standard representations.

Recommended Citation

Sohail, Omayya; Elahi, Inam; Ijaz, Ahsan; Karim, Asim; and Kamiran, Faisal, "Text Classification in an Under-Resourced Language via Lexical Normalization and Feature Pooling" (2018). PACIS 2018 Proceedings. 96.
https://aisel.aisnet.org/pacis2018/96

Download

COinS

PACIS 2018 Proceedings

Text Classification in an Under-Resourced Language via Lexical Normalization and Feature Pooling

Abstract

Recommended Citation

Search

Links

Browse

Author Corner

PACIS 2018 Proceedings

Text Classification in an Under-Resourced Language via Lexical Normalization and Feature Pooling

Authors

Abstract

Recommended Citation

Share

Search

Links

Browse

Author Corner