Abstract

The rise of the social internet has led to an explosion of user-generated content (UGC). UGC can take a number of forms: social media posts, wiki contributions, and product reviews, among many others. The key feature of UGC is that its provenance is from a user of the information system rather than the firm which controls it (Yi et al. 2019). Prior research has explored UGC as a strategy to induce user effort (Goes et al. 2016), as a vector for getting information (Lukyanenko et al. 2019) or as a method to understand customer behavior (Yang et al. 2019). We seek to extend the utility of user-generated content by exploring their use as a categorization strategy using well-known methods (Bag-of-Words, Word2Vec) and conclude by proposing a classification scheme based on a well-validated personality battery as an example as a comparative efficacy exercise. We utilize a common context as an exemplar: a set of games on Steam. This data set has certain advantages, specifically that we can utilize unrelated shocks to make inferences about psychological needs satisfaction based on use patterns. We then attempt to classify games based on the content of reviews. We utilize four different methods of classification: naive bag-of-words, a bag-of-words model enhanced with Word2Vec, a retrieval-augmented generation assisted LLM classifier, and a zero-shot classification using calculated embeddings. We find that, despite the comparative crudeness of the Word2Vec model, that a Word2Vec informed bag-of-words classification achieves better quantitative results than off-the-shelf LLM products while offering qualitative insights into domain-specific vernacular that may be overlooked or not considered by non-trained methods.

Share

COinS