Information retrieval can be likened to a mining process. Searchers drill through a document space using keywords to extract document subsets. These subsets must be reviewed to extract the topically relevant documents from the irrelevant. Searchers interactively learn from the relevance of the document subsets and re- submit more search arguments to perhaps narrow the search to obtain more relevant documents or broaden the search to improve the likelihood of recalling the highest percentage of relevant documents from the document space. Searchers may be aided in the search by using a document organization scheme used by human categorizers to organize the document space. Such schemes, such as the Library of Congress Classification System, tend to be rigid and dated. Documents are greatly increasing and number and organizational schemes such as the LCCS are not adapting well to the varying content of books and documents being added to the document space. What is needed is an automatic mapping tool that 1) takes the document space as it is, 2) creates a conceptual map of the space, and 3) clusters like documents and places them together on the map. This research (in progress) is an attempt to determine the value of the Kohonen Self-Organizing Map (SOM) (Kohonen, 1995) for use as an interactive textual data mining tool for categorization of large sets of documents. The SOM algorithm analyzed 339 Management Information Systems Quarterly abstracts from 1985 to 1997. The first analysis resulted in a map of two major regions -- Information and Systems. This demonstrated that the SOM was working correctly but produced a potentially uninteresting map. What may be more interesting is the next level of conceptual detail, i.e., the major conceptual areas of the MISQ document space below this high level of abstraction. To obtain this map, "management," "information," and "systems" was added to a stop-word list and the Kohonen algorithm was reapplied to obtain a mapping of the MISQ literature at this second level of detail below Management Information Systems. At both levels of abstraction, the 339 abstracts were partitioned among the conceptual regions. This suggests the possibility for an interactive tool that aids searchers in exploring large document spaces by using a "divide and conquer" approach of information retrieval whereby the tool clusters similar documents into topical regions of the map for exploratory browsing.
Orwig, Richard and Pendergast, Mark, "Holistic Information Retrieval Through Textual Data Mining" (2000). AMCIS 2000 Proceedings. 188.