NEAIS 2024 Proceedings

Reliability and Agreement Comparison of ChatGPT-3.5 and ChatGPT-4

Ana Merlo, Worcester Polytechnic InstituteFollow
Roshni Harish, Worcester Polytechnic InstituteFollow
Adrienne Hall-Phillips, Worcester Polytechnic InstituteFollow

Abstract

Sentiment analysis is a rapidly growing field that utilizes text mining to identify positive, negative, or neutral sentiments in reviews, articles, and social media data. However, analyzing social media can be challenging due to the presence of emojis, slang, and acronyms. It is essential to comprehend the context before assigning sentiments to the content. Considering the insights social media data can provide, building on one of our previous research projects on body positivity, this study uses comments from Instagram body positivity posts to conduct sentiment analysis using LLM tools. While LLM is a powerful tool for understanding the nuances of comments, some studies have already demonstrated that the model GPT-3.5 can have an inherent bias towards gender stereotypes, religion, demographics groups, or gender identities. While LLMs are powerful tools to understand the nuances of comments, underlying tones such as sarcasm, weight bias, body shame, eating disorders, or body stigmatizing, cannot be easily detected from traditional methods. Traditional sentiment analysis methods, which rely on basic machine learning algorithms and static lexicons, are contrasted with more advanced techniques powered by deep learning and AI, such as those utilized by ChatGPT. Although there is a dearth of research on the use of generative AI to conduct analysis, there is a lack of research on the application of ChatGPT for sentiment analysis on social media content. Very few studies have also compared two different generative AI models for any type of analysis. Therefore, this study addresses this gap by exploring sentiment evaluations of ChatGPT 3.5 and 4.0, and the reliability and agreement of both models in classifying sentiments from Instagram comments tagged with #bodypositivity. Using Cohen’s Kappa and Krippendorff’s Alpha, we evaluate consistency and accuracy, revealing key discrepancies, especially in neutral sentiment classification. Through thematic sentiment analysis, we assess how each model handles the discrepancies by classifying complex issues like body appreciation, stigma, and mental health. Findings will add to the growing literature on using generative AI for analysis, while also providing insights on the bias that is present in data of this type, particularly how both models handle societal norms about appearance-based judgments and body acceptance. Further, findings may provide insight into a more efficient way to conduct sentiment analysis by handling nuanced and sensitive social media data.

Recommended Citation

Merlo, Ana; Harish, Roshni; and Hall-Phillips, Adrienne, "Reliability and Agreement Comparison of ChatGPT-3.5 and ChatGPT-4" (2024). NEAIS 2024 Proceedings. 17.
https://aisel.aisnet.org/neais2024/17

Download

COinS

NEAIS 2024 Proceedings

Reliability and Agreement Comparison of ChatGPT-3.5 and ChatGPT-4

Abstract

Recommended Citation

Search

Links

Browse

Author Corner

NEAIS 2024 Proceedings

Reliability and Agreement Comparison of ChatGPT-3.5 and ChatGPT-4

Authors

Abstract

Recommended Citation

Share

Search

Links

Browse

Author Corner