Ad Code

Text Analytics on Twitter Text-based Public Sentiment for Covid-19 Vaccine: A Machine Learning Approach

 


Introduction

According to Statista Infographic October 2020 there are 4.66 billion active internet users, 4.28 billion unique mobile internet users, 4.14 active social media users and 4.08 billion active mobile social media users [1] depositing huge amount of data on several social media platform such as Facebook, Twitter, Instagram, WhatsApp and others. Many activities are taken place on the social media such as businesses, political issues and analysis, national economic issues, healthcare, education, disaster management issues [2]. By applying text mining approaches, the huge amount of the unstructured data could be converted to a clean and structural data format [3]. This study choses Twitter as the primary sentiment analysis object because it is one of the most popular microblog platforms with 140 million users posting more than 400 million tweets every day. In this context, mining sentimental data efficiently will help in better understanding the opinion and sentiment of the public especially regarding this recent Covid-19 vaccines.

Covid-19 is a disastrous disease that started spreading amongst people in December, 2019 at Wuhan China, which begin to draw the attention of public. As a result of this pandemic, many different research institutions and companies worldwide are working round the clock to create Covid-19 vaccines. According to regulatory affairs professionals’ society (RAPS) 30th November, 2020, there are 51 Covid-19 vaccines currently undergoing clinical test in which 10 out of the 51 has already reached last stage of the clinical trial (Phase 3) while the remaining 41 vaccines are in the stages ranges from Pre-clinical trial, Phase 1 as well as Phase 2 trials. However, these authors conducted their researches in the domain of text analytics specifically on Twitter data on Covid-19 pandemic [4 – 7]. To the best of our knowledge, works regarding the public sentiment on Covid-19 vaccine is still less explored. This study focuses to explore in this aspect.

Nowadays, people share their opinion and sentiment on social media especially Twitter as it is one of the most popular microblogs that allow short message to be posted by users. Sentiment analysis is one of the applications of natural language processing (NLP) that requires pre-processing especially social media data such as Twitter data that always comes with noisy and contains a lot of characters and symbols that might not be relevant in some research domain of sentiment analysis. Several researchers recommended that pre-processing text data before feeding in to a machine learning models helps for better classification and evaluation to obtain good results, such studies includes [8 – 12].

To address such challenges of noisy text data, this study employs various pre-processing techniques to remove the unwanted data to help in better classification and evaluation accuracy of the models. Such unwanted data in this study includes: URL, hashtags, punctuation, stopword, repeated words, among others.  Likewise, this study aims to classify the public sentiment regarding Covid-19 vaccine. The following research questions would be answered:

R.Q.1. How can pre-processing of Twitter data and machine learning algorithms help in detecting public emotion regarding Covid_19 vaccine?

R.Q.2. What are the performance variations of machine learning algorithms on the classification of aggregated public emotions on Covid-19 vaccine?


The objectives of the study are deduced from the above problem statement and research questions. The main aim and objectives of the study are: 

1.      To collect Twitter data regarding public sentiment on Covid-19 vaccine.

2.      To clean the datasets by applying different pre-processing techniques such as stop word removal, harsh tag removal, lemmatization etc.

3.      To classify and evaluate the sentiment of the collected Twitter data using Support Vector Machine and K-Nearest Neighbour.

 

  LITERATURE REVIEW

According to Nassirtoussi et al., [13] supervised machine learning approach is a model that accept input of a data and predicted the output using regression or classification techniques. Regression techniques is used to quantify continuous reactions, e.g. changes in market values, variations in temperature or variations in energy demand while in classification techniques used for identification of languages, sentiment analysis as well as in medical domain. They further stated that through unsupervised machine learning approach, algorithm scans stored template through input data and is usually used for object recognition, sequential analysis, and so on.

This study [14] highlighted the variation between two machine learning approaches that is supervised and unsupervised learning where supervised learning uses labeled dataset collected from different sources which contains variety of patterns and information to guide the machine, while unsurprised learning deals with the training of machine without guidance by the user, it identify the groups of data based on their hidden similarity and feature with the help of data clustering techniques and association rule.

Mujtaba et al. [15] comes up with five machine learning classification techniques where they mentioned them as supervised learning, semi-supervised learning, unsupervised learning statistical learning and content-based learning. They further explained that supervised machine learning approach is the most popular and common approach used nowadays by many researchers around the globe in which SVM and KNN models are among the best standard machine learning algorithms that are widely used in different domain of research.

 

Fig. 1. Machine Learning Based Text Analytics and Sentiment Analysis System Overview.

Several sentiment analysis and emotion classification studies have been conducted using different standard machine learning algorithms, these studies were based on text from social media using machine learning algorithms (Naïve Bayes, Support Vector Machine, Random Forest, Logistics Regression, K-Nearest Neighbor, and Decision Tree) that have proven to be efficient in text classification as illustrated by [16 - 18].

 Figure 1 shows several steps for text analytics in the context of machine learning classification. Programming language such Python, R and others are used to collect data from social media such as Twitter. After collecting the data, a pre-processing technique is applied to transform the unstructured data into structured by cleaning all the unwanted characters before feeding in the data to a machine learning for classification. 

 To achieve the aims and objectives of the study. Two standard machine learning models were employed. The models are Support Vector Machine (SVM) and K-Nearest Neighbour (KNN).

 


Close Menu