Abstract

Misinformation – content that lacks truth, but the motivation of falsehood is uncertain – on social media during a health pandemic presents a major concern for public health. Recently, the vast volume of news and information around COVID-19, which the World Health Organization refers to as "Infodemic," has led to an unprecedented increase of health misinformation woven into the online narrative about the pandemic. Online narratives, particularly on social media platforms, are critical objects of inquiry as narratives are fundamental to how people construct socially shared belief systems, and that can be the primary means to spread health misinformation online. Specifically, in the case of COVID-19, false social media narratives about the origin or unapproved or untested remedies can influence public health attitudes and behavior, potentially costing billions of dollars and numerous lives. While social media sites assume responsibility for moderating the platforms towards more meaningful and trustworthy content (for instance, see the joint statement from Facebook, Twitter, Google, YouTube, and others to combat fraud and misinformation about COVID-19), several fact-checking organizations across the world are also devoting their efforts to publish an evidence-based analysis of online narratives to convince audiences of its inauthenticity. However, there is a lack of research on how health misinformation could potentially impact individuals' attitudes towards the pandemic. To that end, in this study, we develop a topology of health misinformation to understand individuals' health behaviors. And then, further, we utilize the topology to train a classifier that predicts the misinformation category for any new data. We collected COVID-19 misinformation data from six fact-checking websites (snopes.com, politifact.com, factcheck.org, leadstories.com, factcheck.afp.com, and poynter.com) for a duration of eight months (January 2020 to August 2020). We wrote Python scripts using well-known python libraries named Beautiful Soup for web crawling and Pandas for data analysis. Finally, we used the Latent Dirichlet Allocation (LDA) algorithm to identify the main topics discussed in 8162 fact-checked articles. We obtained a topic model with 40 topics. We then analyzed the topics, the top-ten associated words, and the associated fact-checked articles to identify higher-level themes. Overall, we identified 13 higher-level themes or categories of COVID-19 health misinformation. These topic categories were used in the sklearn feature_extraction module to build a classification model. The feature_extraction module extracts numerical features for the text content by tokenizing the strings and giving each token an integer ID. We then counted the occurrence of tokens in each document and normalized and weighted the documents with diminishing importance of tokens. We used the Count vectorizer and TfidfTransformer functions for this purpose. We then used the scikit-learn implementations of the best parameter settings and the Grid-search algorithm with k-fold cross-validation to ensure a less biased model. After training three classification models, we applied the four performance evaluation metrics - Accuracy, Precision, Recall, and F1-Score. The three classification models (Logistic regression, K-nearest Neighbor, and Multiclass Support vector machine), which we trained and evaluated, were selected based on an extensive literature review. The performance evaluation results suggest that the Multiclass Support Vector Machine (MSVM)- based classifier achieved high performance for accuracy (88%), precision (85%), recall (83%), and F-measure (82%).

Share

COinS