Is adverse impact of COVID-19 more prevalent for specific gender or age group?
Abstract
COVID-19, the highly contagious disease that has been spreading through the world like wildfire, is a black box. There is limited visibility on how the virus behind this disease acts or who it impacts more. Robust data analysis to answer some of the unknowns associated with the disease can help save many lives.
The purpose of this analysis is to identify if death rate for men above 50 years of age is more than other age groups and gender. Once we are able to ascertain this for confirmed cases of the disease around the world, the same can be determined specifically for Australia.
Publicly available demographic data on COVID-19 has been used for the analysis. The data primarily contains details on country of confirmed case, number of cases and the outcome i.e. death or recovery, along with demographic details such as gender and age.
Based on analysis, it has been found that male infections are more than female, further, men over 50 years of age have been more impacted than men under 50 years of age.
This category of population is more at risk and such finding can help plan resources and precautions better for them. Death rate is high at 5.8% as per the dataset. Also, the most commonly exhibited symptoms are fever and cough.
The world has been taken by storm by the spread of Novel Coronavirus, also called COVID-19. Most of the economic activities are on hold, with people around the world surviving on essentials. The virus is alleged to have jumped from bats to humans, it belongs to a class of coronaviruses common among wild animals. The less likely event of animal to human transmission occurred in China last year. The first case was reported last in 2019, in Wuhan, which became an epicentre of this highly contagious disease. Since then, as the movement of people in and out of Wuhan was still happening, more and more people travelled around the world and caused the disease to spread. Very little is known about these viruses. Doctors, scientists and data scientists are putting in all efforts to gather more information on how the virus behaves, who it impacts, how much risk it is to an individual and how to treat those affected. In the past month we have seen how this disease has severely strained the resources of countries like Italy and Spain. In such a scenario, it is important that we know which part of the population is most at risk, so that necessary preventive actions can be taken. An extra amount of caution is required. With the help of demographic data on COVID-19, we can assess if there is a specific age group and gender, which is at more risk. Further, there have been several reports on how elderly men are more likely to suffer from the adverse effect of the disease.
The role of data scientists in the current pandemic situation is huge. Almost 4 million have been affected worldwide, the medical fraternity is already working beyond what we could have considered humanly possible. The best way to defeat the virus is by finding out whatever we can about it. The phrase ‘knowledge is power’ makes perfect in this situation.
Knowledge of which gender and age-group is at more risk can help with better planning of country’s resources. Extra precautionary measures can be prescribed for identified population sets. There have been several reports in the past which indicated that elderly men are at a higher risk, this could be attributed to their lower immunity. A confirmation of this hypothesis can also help prioritize vaccines, once available, on logical grounds.
COVID-19 data is publicly available. All related data has been deemed open-source so that data scientists can work on it for novel insights. The chosen dataset has been picked from Johns Hopkins Github repository. A random sample of 1085 records with 19 variables has been chosen for observation.
This data is a compilation from the official reports by nations around the world, published daily. Johns Hopkins publishes a dashboard on confirmed cases and outcomes, split by geographical area. A sample of 1085 records has been analysed.
There are 19 variables in the dataset. List of the variables –
Figure 1 Dataset variables
From the list of column names, we can determine the nature of data. It contains a unique identifier for every row, number of cases in a country, date when case was reported, summary on the type of case, location and country of the case. Further, there are demographic details i.e. gender and age. There are details on dates of onset, exposure and if the case was from Wuhan or someone who visited Wuhan. The outcome of the case is also available. Credibility of case can be ascertained by looking at the source of information.
Methods
Various steps have been performed for this analysis and to arrive at the insights, these are described below. RStudio has been installed with the help of manual from CRAN (R Installation and Administration, 2020). RVersion used for the analysis is ‘1.1.383’.
Preliminary understanding of the data is important before we start with any analysis. Data has been imported into a data frame using the read.csv command. The str command has been used to identify data types and example values in the data frame (str function, 2020)
Type conversion
Text data is stored as factors, as symptom data is required to understand the most common symptoms in patients, type of this column has been changed. The lapply function has been used (lapply function, 2020)
Age data has been converted from factors to numeric using the as.numeric function (Statistic Globe, 2020)
Unstructured to structured data
One of the columns called ‘symptom’ has text data, this has been cleaned to remove slashes, spaces and other characters found in text, for better analysis. This has been done by loading new libraries – dplyr, stringr, ggplot2, tidytext (Text processing in R, 2020)
Data cleaning
Death rate can be figured out from the data using the death column. However, a date is mentioned for cases where outcome of the case has been death. This information is not required, and such values have been replaced with a 1, by applying the condition mydata$death !=”0”. Data stored as character is then converted to numeric
Further, the number of deaths was found to be 63 out of 1085 i.e. 5.8?ath rate according to the chosen dataset
Group based data summarization
The data has been summarized by country to understand the distribution using table function (Table function in R, 2020). This gives a fair idea of representation of countries in data set. As a result, it was observed that China has the highest representation of data (197), followed by Japan (190), South Korea (114) and Hong Kong (94)
Data visualization
As we clean the data and come to the main objective of finding out the proportion of male and female cases, we found there have been more cases among men. 60% of those infected are male and 40% are female. This plot has been created using the barplot command (Barplot, 2020), with proportion distribution for better picture.
Figure 2 Gender split for all data |
Further, age has been plotted as histogram, with breaks defined in the interval of 10. Hist function has been used (Hist function, 2020) (The subset function, 2020)
Figure 3 Age split for all data |
The data complies with our hypothesis that those over the age of 50 are at a higher risk
Data subset selection
Data subset is created to understand if the observation on gender and age for all nations applies to Australia as well. The subset function has been used for this (The subset function, 2020)
Figure 4 Gender split for Australia
Figure 5 Age split for Australia |
The data visualizations show that worldwide observations not only apply to Australia but are more profound. However, no conclusion can be drawn as the sample data is less than 30.
Exploratory visualization
Further, after cleaning the symptoms data, it was identified that a few symptoms occurred more frequently than others. ggplot2 function (ggplot2, 2020) has been used to identify the most common symptoms
Figure 6 Most common symptoms using ggplot2
Results and Discussion
Through data exploration we identified the split of this random dataset between countries, gender and age. We also identified that if grouped by countries, a conclusive statement can be made only for a few countries due to lack of data. There are a few key points that emerged from the analysis about COVID-19 –
-
Population at higher risk – Male, above 50 years of age
-
Death rate – 5.8%
-
Most common symptoms – Fever and cough
Conclusions
Data understanding and cleaning, followed by data conversion, subset formation, grouping and visualizations, resulted in some actionable insights. These insights have a greater implication in everyday life. We have determined that death rate for the disease is high, which implies that without proper planning and preparation, we may lose several lives. Further, elderly men should be extra-cautious, government can take measures to restrict movement of this category of population. There should be new services aimed at making their life easier and limiting interaction with outside world. Also, those exhibiting symptoms of COVID-19 i.e. fever and cough should be tested and isolated to prevent further spread of the disease.
R Installation and Administration. (2020). Retrieved from CRAN R Project: https://cran.r-project.org/doc/manuals/r-release/R-admin.html
str function. (2020). Retrieved from R documentation: https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/str
lapply function. (2020). Retrieved from R documentation: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/lapply
Statistic Globe. (2020). Retrieved from Factors to Numbers: https://statisticsglobe.com/how-to-convert-a-factor-to-numeric-in-r/
Text processing in R. (2020). Retrieved from https://www.mjdenny.com/Text_Processing_In_R.html
Table function in R. (2020). Retrieved from Data science made simple: http://www.datasciencemadesimple.com/table-function-in-r/
Barplot. (2020). Retrieved from R documentation: https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/barplot
Hist function. (2020). Retrieved from R documentation: https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/hist
The subset function. (2020). Retrieved from R Bloggers: https://www.r-bloggers.com/r-101-the-subset-function/
The subset function. (2020). Retrieved from R Bloggers: https://www.r-bloggers.com/r-101-the-subset-function/
ggplot2. (2020). Retrieved from Tidyverse: https://ggplot2.tidyverse.org/
Figure 7 str(mydata)
Figure 9 Type conversion- lapply
#check R version
RStudio.Version()
#load csv file
mydata <- read.csv(file.choose())
#check data
colnames(mydata)
#check data types
str(mydata)
#convert country column to character
mydata[18] <- lapply(mydata[18], as.character)
mydata[18]
#plot of gender
barplot(prop.table(table(mydata[7])))
#convert age data to numeric
mydata[8] <- lapply(mydata[8], as.numeric)
#plot of age
hist( mydata$age,
breaks = c(0,10,20,30,40,50,60,70,80,90,100))
#subset country data for Australia
ausdata<-subset(mydata, mydata$country=="Australia")
ausdata
#plot of gender for Australia
barplot(prop.table(table(ausdata[7])))
#plot of age for Australia
hist( ausdata$age,
breaks = c(0,10,20,30,40,50,60,70,80,90,100))
#load libraries for converting unstructured text data to more structured form
library(dplyr) # Data wrangling & manipulation
library(stringr) # For managing text
library(ggplot2) # For data visualizations & graphs
#create new list of symptoms
symptomslist<-c(mydata$symptom)
symptomslist
#clean text data
symptomslist <- paste(symptomslist, collapse = " ") # Remove spaces
symptomslist <- str_replace_all(symptomslist, pattern = '\"', replacement = "") # Remove slashes
symptomslist <- str_replace_all(symptomslist, pattern = '\n', replacement = "") # Remove \n
symptomslist <- str_replace_all(symptomslist, pattern = '\u0092', replacement = "'") #Replace with quote
symptomslist <- str_replace_all(symptomslist, pattern = '\u0091', replacement = "'") #Replace with quote
#create new vectors with symptom name and frequency
name<-c('fever','cough','throat','pain','headache','diarrhea','chills','breath','runny nose')
freq<-c(length(grep("fever", symptomslist)),
length(grep("cough", symptomslist)),
length(grep("throat", symptomslist)),
length(grep("pain", symptomslist)),
length(grep("headache", symptomslist)),
length(grep("diarrhea", symptomslist)),
length(grep("chills", symptomslist)),
length(grep("breath", symptomslist)),
length(grep("runny nose", symptomslist)))
#create a dataframe of symptom frequencies
symptomsfreq<-data.frame(name,freq)
#ggplot to determine most common symptoms
ggplot(data=symptomsfreq, aes(x=freq,y=name))+geom_bar(stat="identity")
List Of Assignment Services Provided By BestAssignmentExperts.Com
|
Law Assignment Help | Computer Assignment Help |
|
Marketing Assignment Help | Management Assignment Help |