Python 2451 Assignment - Digital Analytics - Individual Research Report - University of Queensland
Digital Analytics Introduction
With the current Covid-19 crisis, political leaders and policymakers are continuously interacting with people and media via different social platforms such as Facebook, Twitter, and video conferencing. In this assignment, the main focus is to use Twitter communication and it’s API to extract the post/text made by these few accounts: Australia’s prime minister (@ScootMorrisonMP), the minister of health (@greghuntmp), the health department (@healthgovau). In this report, some of the scrapings of data and visualization on the post details of these three organization is focussed on.
It is really important for these organizations to provides the current scenario to the public. Using these social media they interact with the society giving information about the number of new cases, the number of infected cured, the number of dead. They also see to these the proper guidance is given such has which area has most cases and what important sectors should be kept operating in these areas.
It is also important to notice how they are interacting, what is the tone they used. Some f the key points mentioned in the assignment were related to categorizations based on factuality, emotionality, and locus responsibility. More details are given in the later section.
Different new policies are implemented to prevent crises and economic depression in such crises. Some policies are related to banking, loans while others affect the consumers. In this report, we will also focus on how this communication has evolved in later stages. Different data visualization techniques are used to study the importance of data cleaning and data analysis.
Methodology – Python Programming
Python provides a very simple library twitter_scraper which uses API provided by twitter to get the result based on the queries selected. Using get_tweets() function, one can get a tweet post of a selected tag which is passed as an argument for this function. This tweet is a dictionary that contains different keys. Some of them are user_id, tweet_username, text, video_url, likes, retweets, replies, country, type, handle.
Now using the flexibility of comparing time/date in python only those tweets are selected whose timestamp is after the date 2020/12/1. When a tweet is after this date, it is appended in a list and the later this list is converted in Dataframe using pandas library and then using to_excel() function in pandas it is stored in a .xlsx file.
The given tweet data (twitterdata.xlsx) contains a total of 1590 tweets. Upon categorizing this based on the organization, it was found that 262 number of tweets are from handle ScottMorrisonMP, 428 number of tweets are from handle greghuntmp, 900 tweets are from handle healthgovau. It takes nearly/less than a minute to harvest this data from twitter. There are a total of 57 columns for each tweet which contains different information/metadata regarding the tweet.
Added a new variable to the Dataframe, which is extracted from the data file given to us. Then every text from each tweet is checked if it has the following potentials words like ‘corona’, ‘virus’, ’covid’. If a tweet has any of these words, it is counted as a useful tweet in our visualization, so ‘yes’ is stored in the new variable corresponding to this tweet, else a ‘no’ is stored.
Categorize the new Dataframe into three subsets, for ScottmorrisonMP, greghuntmp, healthgovau using pandas functionality. Then randomly collected 50 tweets from each category making a total of 150 tweets using sample() function. As the given data already has a column called ‘days-since-dec1’ which shows the number of days relative to 2019/12/1.
Now grouping by this column for each category, we have the number of tweets made on each day. Using count() function we can get the total number of tweets in one group. Doing this for all categories and plotting it as a time series we get Figure 1 shown below.
The other data visualization which was done was to detect the emotional content in the text of each tweet. After randomly getting the 150 tweets as mentioned before, search on each of the few words which could give an idea of the emotionality content in the text like a threat, reassurance, or neutral. The result is shown in Figure 2 below. The threat word was chosen among [“don’t do”, “then”, “negative”], the reassurance word was chosen from [“danger”, “gatherings”, “positive”, “limited”], and the remaining tweets are neutral by default. More on this in the discussion section.
The other data visualization did was to identify factuality amongst reporting current status, structural precautions, individual precautions. After randomly getting the 150 tweets as mentioned before, search on each of the few words which could give an idea of the factuality content in the text like a status, precautions. The result is shown in Figure 3 below. The current status word was chosen among ["number", "casualties", "cured"], the structural precautions word was chosen from["force closure", "closure", "social gathering", "limiting"], the individual precautions were chosen from ["social distancing", "hand washing", "stay inside", "seek medical", "medical attention"] and the remaining tweets are none by default. More on this in the discussion section.
We can make the following conclusion from the above Figure 1: The number of tweets in the later few weeks was higher compared to the starting a few weeks. Once there was a high peak, from which one can conclude that most new cases, the policies were announced on that day. We can see that the curve is not smooth. One of the reasons for this is the low number of data set chosen. If the data size(here number of tweets) is in the range of thousands, then we can see a smooth curve.
Also, we can interfere that on a normal day one would just make a single tweet including all the status/updates that happened on a particular day, which indicates that one tweet a day is most in Figure 1. Someday when new rules/policy are needed to be shared the frequency of tweets increases. We can predict based on this that if the number of tweets increases as time passes by, the number of tweets will also have more peaks and high peaks in terms of number.
The interference which can be made from Figure 2 are: it is important the emotional content in the tweet are expressed properly, some tweets need to be as a threat, while some stays as reassuring and other neutral. The number of tweets which are classified as a threat based on few keywords mentioned above needs to be low, as more of such type of tweet can cause nuisance among the society, Most of the tweet should ideally remain neutral and some tweets like once in a week must provide reassurance to the people pointing that they are safe and all the rules/actions are taken in viewing many positive points. We can see the same things can be seen in the Figure 2. The number of neutral posts is the most. However, using large data could provide more insight into these details.
The interference can be made from Figure 3, which shows the factuality of the tweet. Telling whether the tweet is for reporting current status, precautions for the structural organization, or an individual. These categories of tweet provide an insight into how to take preventive measures, what are the changes in the number. We can see from Figure 3 that the number of tweets reporting the current status is near 10. This is not the quite correct interpretation. Due to less number of tweets chosen as data size and randomness we number of tweets reporting current status are just two. There are few tweets for structural precautions and more focus is given to the individual’s precautions.
Now let discuss what might we the remaining tweets which are under None (in Figure 3). Many of the tweets are toward the economic policy, some are for the status reporting things outside the country. We have not focussed on these points in this assignment. Also, we can further study the given data predicting whether the tweet is for the internal responsibility or it is external responsibility. In short, saying whether to points other countries as source of danger and reason why our measures are not optimal/effective or being responsible for the current outcomes based on the decision taken. Saying that we should be more cautious, we should do enough, and takes more steps with better possible outcomes.
This research gives us a simple view of how data can be visualized and what are the different techniques in which one should make focus. I think data visualization on covid tweets itself is a weakness as it doesn’t provide much variety. If there is a larger data set such as the IMDB movies dataset then one could find many different methods and can do more research, as these found to be an interesting topic. Some of the ideas in that would be finding the category of the movie which most of the audience rated.
As a start, this research provided many insights. By concluding that in such hard times we should stay together and we will win.