As data and information quality research has evolved towards becoming a unified body of knowledge, the importance of defining the identity of this research area has also grown. Our paper presents the results of a preliminary study with the aim of helping to define this identity from its core topics and themes. To do so we analyze the abstracts of 324 journal articles and conference proceedings published over the past ten years. Latent semantic analysis is used with these abstracts to develop term-to-term semantic similarities and term-to-factor loadings, from which six core topics and fifteen core themes of data quality research are identified.

This paper presents a quantitatively based and reproducible method for identifying topics and themes. In further research, this method will be used to analyze the frequency of papers published in each topic and theme to show their level of activity. Applying this method for publications within discrete periods of time can show how topics and themes change and evolve. This research has the potential to help define the identity of data and information quality research, find the topics and themes receiving the greatest attention, and reveal trends occurring in this growing area.