Start Date
11-8-2016
Description
Corpus Periodization is the process of segmenting a corpus into a set of smaller and discursively coherent periods while retaining its chronological order. Corpus Periodization is often used by social researchers in fields such as sociology and history to examine texts of topic-specific and temporally ordered corpora. Currently, there are no robust, automated, and easy-to-implement methods to periodize text corpora. In this paper, we propose a new framework that automates Corpus Periodization. This method relies on a simple statistical significance test that assesses the changes in the number of documents between neighboring segments and a document similarity measure that evaluates the similarity of texts between neighboring segments. We tested the proposed solution on a corpus consisting of 4,821 news articles containing the term “corporate governance.” We were able to reduce the original number of annual segments from twenty-eight to seven or fewer relevant periods.
Recommended Citation
Alsudais, Abdulkareem and Tchalian, Hovig, "Corpus Periodization Framework to Periodize a Temporally Ordered Text Corpus" (2016). AMCIS 2016 Proceedings. 15.
https://aisel.aisnet.org/amcis2016/Decision/Presentations/15
Corpus Periodization Framework to Periodize a Temporally Ordered Text Corpus
Corpus Periodization is the process of segmenting a corpus into a set of smaller and discursively coherent periods while retaining its chronological order. Corpus Periodization is often used by social researchers in fields such as sociology and history to examine texts of topic-specific and temporally ordered corpora. Currently, there are no robust, automated, and easy-to-implement methods to periodize text corpora. In this paper, we propose a new framework that automates Corpus Periodization. This method relies on a simple statistical significance test that assesses the changes in the number of documents between neighboring segments and a document similarity measure that evaluates the similarity of texts between neighboring segments. We tested the proposed solution on a corpus consisting of 4,821 news articles containing the term “corporate governance.” We were able to reduce the original number of annual segments from twenty-eight to seven or fewer relevant periods.