Text Mining and Analytics

Assessing Text Representation Methods on Tag Prediction Task for StackOverflow

Erjon Skenderi, Tampere UniversityFollow
Salla-Maaria Laaksonen, University of HelsinkiFollow
Kostas Stefanidis, Tampere UniversityFollow
Jukka Huhtamäki, Tampere UniversityFollow

Location

Online

Event Website

https://hicss.hawaii.edu/

Start Date

3-1-2023 12:00 AM

End Date

7-1-2023 12:00 AM

Description

A large part of knowledge evolves outside of the operations of an organization. Question and answer online social platforms provide an important source of information to explore the underlying communities. StackOverflow (SO) is one of the most popular question and answer platforms for developers, with more than 23 million questions asked. Organizing and categorizing data is crucial to manage knowledge in such large quantities. Questions posted on SO are assigned a set of tags and textual content of each question may contain coding syntax. In this paper, we evaluate the performance of multiple text representation methods in the task of predicting tags for SO questions and empirically prove the impact of code syntax in text representations. The SO dataset was sampled and questions without code syntax were identified. Two classical text representation methods consisting of BoW and TF-IDF were selected along four other methods based on pre-trained models including Fasttext, USE, Sentence-BERT and Sentence-RoBERTa. Multi-label k'th Nearest Neighbors classifier was used to learn and predict tags based on the similarities between feature-vector representations of the input data. Our results indicate a consistent superiority of the representations generated from Sentence-RoBERTa. Overall, the classifier achieved a 17% or higher improvement on F1 score when predicting tags for questions without any code syntax in content.

Recommended Citation

Skenderi, Erjon; Laaksonen, Salla-Maaria; Stefanidis, Kostas; and Huhtamäki, Jukka, "Assessing Text Representation Methods on Tag Prediction Task for StackOverflow" (2023). Hawaii International Conference on System Sciences 2023 (HICSS-56). 4.
https://aisel.aisnet.org/hicss-56/cl/text_analytics/4

Download

COinS

Jan 3rd, 12:00 AM Jan 7th, 12:00 AM

Assessing Text Representation Methods on Tag Prediction Task for StackOverflow

Online

https://aisel.aisnet.org/hicss-56/cl/text_analytics/4

Text Mining and Analytics

Assessing Text Representation Methods on Tag Prediction Task for StackOverflow

Location

Event Website

Start Date

End Date

Description

Recommended Citation

Search

Browse

Author Corner

Text Mining and Analytics

Assessing Text Representation Methods on Tag Prediction Task for StackOverflow

Presenter Information

Location

Event Website

Start Date

End Date

Description

Recommended Citation

Share

Search

Browse

Author Corner