Abstract

Short texts are characterized by short length and sparse features. The study is less effective in the classification of short texts. Motivated by this, this paper seeks to extract features from the “topic” and “word” levels with proposing a convolutional neural network (CNN) based on topic and word, which is named TW-CNN. It uses the Latent Dirichlet Allocation (LDA), a topic model, and word2vec to obtain two distinct word vector matrices, which are then respectively taken as the inputs of two CNNs. After the process of convolution and pooling of the CNNs, there are two different vector representations of the text. And the vector representations are connected with the text-topic vector obtained by LDA, forming the final representation vector of the text. In the end, softmax text classification is conducted. And experiments based on short news texts show that the TW-CNN model has an improvement over the traditional CNNs.

Share

COinS