Start Date

11-8-2016

Description

Corpus Periodization is the process of segmenting a corpus into a set of smaller and discursively coherent periods while retaining its chronological order. Corpus Periodization is often used by social researchers in fields such as sociology and history to examine texts of topic-specific and temporally ordered corpora. Currently, there are no robust, automated, and easy-to-implement methods to periodize text corpora. In this paper, we propose a new framework that automates Corpus Periodization. This method relies on a simple statistical significance test that assesses the changes in the number of documents between neighboring segments and a document similarity measure that evaluates the similarity of texts between neighboring segments. We tested the proposed solution on a corpus consisting of 4,821 news articles containing the term “corporate governance.” We were able to reduce the original number of annual segments from twenty-eight to seven or fewer relevant periods.

Share

COinS
 
Aug 11th, 12:00 AM

Corpus Periodization Framework to Periodize a Temporally Ordered Text Corpus

Corpus Periodization is the process of segmenting a corpus into a set of smaller and discursively coherent periods while retaining its chronological order. Corpus Periodization is often used by social researchers in fields such as sociology and history to examine texts of topic-specific and temporally ordered corpora. Currently, there are no robust, automated, and easy-to-implement methods to periodize text corpora. In this paper, we propose a new framework that automates Corpus Periodization. This method relies on a simple statistical significance test that assesses the changes in the number of documents between neighboring segments and a document similarity measure that evaluates the similarity of texts between neighboring segments. We tested the proposed solution on a corpus consisting of 4,821 news articles containing the term “corporate governance.” We were able to reduce the original number of annual segments from twenty-eight to seven or fewer relevant periods.