Document similarity is an important concept for many research questions. It can be applied to trace information exchanged on the capital market. For similarity calculations, the document must be transformed into a vector (document representation). Researchers can choose from a variety of document representations. We review the finance and accounting literature and find many different practices for estimating document similarity but little guidance on how to choose the right approach. To address this gap, we propose a framework of three similarity dimensions (object, author, and time). Based on this framework, we conduct an experiment on a corpus of analyst reports to quantify the accuracy of the estimated similarity. Our results help researchers and practitioners to choose an appropriate document representation for their analysis. Doc2vec achieves the overall highest accuracy, while Latent Dirichlet Allocation performs well on the object dimension. Bag-of-words models achieve surprisingly promising results despite their simplicity.



